Práctica 1. Aprendizaje Automático¶

Authors: Carlos Iborra Llopis (100451170), Alejandra Galán Arrospide (100451273)
For additional notes and requirements: https://github.com/carlosiborra/Grupo02-Practica1-AprendizajeAutomatico

❗If you are willing to run the code yourself, please clone the full GitHub repository, as it contains the necessary folder structures to export images and results❗

0. Table of contents¶

  • Práctica 1. Aprendizaje Automático
    • 0. Table of contents
    • 1. Requirements
    • 2. Reading the datasets
    • 3. Exploratory Data Analysis
    • 4. Train-Test division
    • 5. Basic Methods
    • 6. Reducing Dimensionality
    • 7. Advanced methods
    • 8. Best model
    • 9. Final Conclusions

1. Requirements¶

In [69]:
""" Importing necessary libraries """
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import missingno as msno
import seaborn as sns
import scipy.stats as st
import scipy
import sklearn

from matplotlib.cbook import boxplot_stats as bps

1.1. Cleaning ../data/img/ folder¶

This way we avoid creating multiple images and sending the old ones to the trash.
Also using this to upload cleaner commits to GitHub.

In [70]:
""" Cleaning the ../data/img/ folder """
import os
import glob

files = glob.glob("../data/img/*")
for f in files:
    if os.path.isfile(f) and f.endswith(".png"):
        os.remove(f)

files = glob.glob("../data/img/box-plot/*")
for f in files:
    if os.path.isfile(f) and f.endswith(".png"):
        os.remove(f)

2. Reading the datasets¶

Reading the datasets from the bz2 files, group 2.

In [71]:
""" Reading the dataset """
disp_df = pd.read_csv("../data/disp_st2ns1.txt.bz2", compression="bz2", index_col=0)
comp_df = pd.read_csv("../data/comp_st2ns1.txt.bz2", compression="bz2", index_col=0)

3. EDA¶

Key Concepts of Exploratory Data Analysis

  • 2 types of Data Analysis
    • Confirmatory Data Analysis
    • Exploratory Data Analysis
  • 4 Objectives of EDA
    • Discover Patterns
    • Spot Anomalies
    • Frame Hypothesis
    • Check Assumptions
  • 2 methods for exploration
    • Univariate Analysis
    • Bivariate Analysis
  • Stuff done during EDA
    • Trends
    • Distribution
    • Mean
    • Median
    • Outlier
    • Spread measurement (SD)
    • Correlations
    • Hypothesis testing
    • Visual Exploration

3.0. Dataset preparation¶

To conduct exploratory data analysis (EDA) on our real data, we need to prepare the data first. Therefore, we have decided to separate the data into training and test sets at an early stage to avoid data leakage, which could result in an overly optimistic evaluation of the model, among other consequences. This separation will be done by dividing the data prematurely into training and test sets since potential data leakage can occur from the usage of the test partition, especially when including the result variable.

It is important to note that this step is necessary because all the information obtained in this section will be used to make decisions such as dimensionality reduction. Furthermore, this approach makes the evaluation more realistic and rigorous since the test set is not used until the end of the process.

In [72]:
""" Train Test Split (time series) """

# * Make a copy of the dataframe (as Padas dataframe is mutable, therefore uses a reference)
disp_df_copy = disp_df.copy()

# print(disp_df)
# print(disp_df_copy)

# Now we make the train_x, train_y, test_x, test_y splits taking into account the time series
# Note: the time series is ordered by date, therefore we need to split the data in a way that the train data is before the test data
# Note: the 10 first years are used for training and the last two years for testing
# Note: this is done because if not, we will be predicting the past from the future, which leads to errors and overfitting (data leakage) in the model

# * Calculate the number of rows for training and testing
num_rows = disp_df_copy.shape[0]
num_train_rows = int(
    num_rows * 10 / 12
)  # 10 first years for training, 2 last years for testing

# * Split the data into train and test dataframes (using iloc instead of train_test_split as it picks random rows)
train_df = disp_df_copy.iloc[
    :num_train_rows, :
]  # train contains the first 10 years of rows
test_df = disp_df_copy.iloc[
    num_train_rows:, :
]  # test contains the last 2 years of rows

# Print the number of rows for each dataframe
print(f"Number of rows for training (EDA): {train_df.shape[0]}")
print(f"Number of rows for testing: {test_df.shape[0]}")


# ! We maintain the original dataframe for later use (as we will divide it into train and test dataframes below)
# ! For the EDA, we will use the train_df dataframe (with the outpout variable).
Number of rows for training (EDA): 3650
Number of rows for testing: 730

3.1. Dataset description¶

  • apcp_sfc: 3-Hour accumulated precipitation at the surface (kg·m⁽⁻²⁾)
  • dlwrf_sfc: Downward long-wave radiative flux average at the surface (W·m⁽⁻²⁾)
  • dswrf_sfc: Downward short-wave radiative flux average at the surface (W·m⁽⁻²⁾)
  • pres_msl: Air pressure at mean sea level (Pa)
  • pwat_eatm: Precipitable Water over the entire depth of the atmosphere (kg·m⁽⁻²⁾)
  • spfh_2m: Specific Humidity at 2 m above ground (kg·kg⁽⁻¹⁾)
  • tcdc_eatm: Total cloud cover over the entire depth of the atmosphere (%)
  • tcolc_eatm: Total column-integrated condensate over the entire atmos. (kg·m⁽⁻²⁾)
  • tmax_2m: Maximum Temperature over the past 3 hours at 2 m above the ground (K)
  • tmin_2m: Mininmum Temperature over the past 3 hours at 2 m above the ground (K)
  • tmp_2m: Current temperature at 2 m above the ground (K)
  • tmp_sfc: Temperature of the surface (K)
  • ulwrf_sfc: Upward long-wave radiation at the surface (W·m⁽⁻²⁾)
  • ulwrf_tatm: Upward long-wave radiation at the top of the atmosphere (W·m⁽⁻²⁾)
  • uswrf_sfc: Upward short-wave radiation at the surface (W·m⁽⁻²⁾)
In [73]:
# Display all the columns of the dataframe
pd.set_option("display.max_columns", None)

train_df.describe()
Out[73]:
apcp_sf1_1 apcp_sf2_1 apcp_sf3_1 apcp_sf4_1 apcp_sf5_1 dlwrf_s1_1 dlwrf_s2_1 dlwrf_s3_1 dlwrf_s4_1 dlwrf_s5_1 dswrf_s1_1 dswrf_s2_1 dswrf_s3_1 dswrf_s4_1 dswrf_s5_1 pres_ms1_1 pres_ms2_1 pres_ms3_1 pres_ms4_1 pres_ms5_1 pwat_ea1_1 pwat_ea2_1 pwat_ea3_1 pwat_ea4_1 pwat_ea5_1 spfh_2m1_1 spfh_2m2_1 spfh_2m3_1 spfh_2m4_1 spfh_2m5_1 tcdc_ea1_1 tcdc_ea2_1 tcdc_ea3_1 tcdc_ea4_1 tcdc_ea5_1 tcolc_e1_1 tcolc_e2_1 tcolc_e3_1 tcolc_e4_1 tcolc_e5_1 tmax_2m1_1 tmax_2m2_1 tmax_2m3_1 tmax_2m4_1 tmax_2m5_1 tmin_2m1_1 tmin_2m2_1 tmin_2m3_1 tmin_2m4_1 tmin_2m5_1 tmp_2m_1_1 tmp_2m_2_1 tmp_2m_3_1 tmp_2m_4_1 tmp_2m_5_1 tmp_sfc1_1 tmp_sfc2_1 tmp_sfc3_1 tmp_sfc4_1 tmp_sfc5_1 ulwrf_s1_1 ulwrf_s2_1 ulwrf_s3_1 ulwrf_s4_1 ulwrf_s5_1 ulwrf_t1_1 ulwrf_t2_1 ulwrf_t3_1 ulwrf_t4_1 ulwrf_t5_1 uswrf_s1_1 uswrf_s2_1 uswrf_s3_1 uswrf_s4_1 uswrf_s5_1 salida
count 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3.650000e+03
mean 0.610222 0.251049 0.479367 0.279969 0.525625 316.590458 316.996492 324.225574 343.169304 342.582550 0.074371 163.928966 376.718929 686.534869 508.429988 101718.580471 101774.517076 101743.013770 101538.253073 101499.397514 21.394485 21.536129 22.127195 22.595594 22.384870 0.007844 0.008848 0.009356 0.009473 0.009918 0.069240 0.067845 0.064862 0.065706 0.062366 0.069539 0.068172 0.065166 0.066036 0.062748 286.950030 288.292227 292.803749 294.483694 294.542492 284.595935 284.638684 284.617400 292.733513 291.084714 284.846286 288.227387 292.740802 294.299550 291.301035 284.094056 289.230769 295.533258 295.904819 290.366407 375.991521 381.989673 400.742449 439.104661 431.318749 247.736467 247.626828 251.950057 262.207928 261.074238 0.078107 38.716712 76.394795 127.098207 99.476613 1.638200e+07
std 2.245850 0.994112 1.756408 1.120933 1.931408 56.119896 58.129352 58.941747 61.150202 61.027007 0.305126 112.645372 159.486316 227.642854 193.753483 725.206610 731.500969 720.701217 699.477989 715.361146 12.256253 12.358856 12.583364 12.633154 12.401121 0.004398 0.005039 0.005175 0.005097 0.005456 0.167104 0.169653 0.171287 0.172516 0.166113 0.166989 0.169522 0.171172 0.172385 0.165958 8.925065 9.743169 9.898253 9.789117 9.776615 8.735982 8.862301 8.866503 9.950300 10.099684 8.722593 9.795209 9.944761 9.795537 10.083859 8.861650 9.756852 9.148308 9.317363 10.462108 46.586515 49.914820 50.766618 53.159310 54.417631 36.270918 36.289003 35.798277 38.698726 38.427066 0.258752 26.010130 30.743175 40.765618 35.505727 8.059674e+06
min 0.000000 0.000000 0.000000 0.000000 0.000000 158.971770 160.032903 165.524543 183.671312 186.342961 0.000000 0.000000 20.000000 30.000000 20.000000 99316.970881 99315.887074 99327.755682 99040.100852 98830.153409 1.100000 1.314819 1.107352 1.142803 1.201246 0.000462 0.000485 0.000451 0.000478 0.000468 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 254.589220 254.937418 258.549777 260.800365 260.863475 251.941358 249.576132 249.576714 258.698331 258.171345 251.942065 254.844406 258.552646 260.795430 258.170049 250.100794 256.360800 263.634377 264.533564 256.520408 229.296161 223.985486 246.314349 278.576630 271.707606 104.671267 113.559602 118.679132 119.393449 121.951425 0.000000 0.000000 3.181818 4.363636 2.545455 5.100000e+05
25% 0.000000 0.000000 0.000000 0.000000 0.000000 270.043573 267.583016 275.008281 292.786299 291.777096 0.000000 52.727273 240.000000 525.454545 344.477273 101266.472124 101311.399680 101283.033381 101102.175426 101049.033203 10.879000 10.718024 11.122964 11.558385 11.559638 0.003991 0.004229 0.004617 0.004736 0.004754 0.000000 0.000000 0.000000 0.000000 0.000000 0.000618 0.000564 0.000545 0.000673 0.000727 280.127903 280.413352 284.916120 286.996939 287.043590 277.844468 277.666665 277.653641 284.828555 283.228638 278.075898 280.285865 284.764714 286.841753 283.495367 277.025516 281.298508 288.689326 288.982901 281.922856 338.180517 340.208757 358.706337 398.061333 388.214025 230.536257 230.759227 234.398558 246.594118 244.431134 0.000000 14.000000 53.818182 108.818182 74.909091 1.061385e+07
50% 0.000000 0.000000 0.000000 0.000000 0.000000 319.801794 321.400251 328.456741 345.402277 345.107513 0.000000 150.000000 384.318182 730.000000 525.636364 101645.975852 101704.351207 101674.750000 101472.419389 101425.127131 19.191209 19.163636 19.650000 20.290909 20.194252 0.007246 0.008248 0.008909 0.009156 0.009518 0.004545 0.004545 0.003636 0.003636 0.002727 0.005118 0.004855 0.004145 0.004345 0.003673 287.597683 289.039583 293.757317 295.529044 295.598358 285.070623 285.192374 285.106049 293.606805 291.806785 285.381022 288.990954 293.704959 295.310534 292.050580 284.581711 290.050444 296.204518 296.669453 291.126846 376.267101 382.791372 401.524648 440.987373 433.520339 253.350231 253.394166 257.342928 270.790095 269.287814 0.000000 35.500000 79.636364 136.636364 105.454545 1.638195e+07
75% 0.114545 0.051818 0.121591 0.033636 0.090000 367.134144 370.342597 378.683015 399.545104 398.891589 0.000000 264.454545 524.636364 893.636364 693.681818 102131.380504 102188.719283 102148.666726 101940.005327 101919.873402 31.188882 31.471632 32.439831 33.103788 32.459132 0.011612 0.013523 0.014275 0.014169 0.015062 0.055455 0.056364 0.042727 0.043636 0.038182 0.056136 0.056918 0.042900 0.043523 0.038843 294.329548 296.940075 301.377123 302.732298 302.753016 292.290495 292.587547 292.576612 301.351051 299.909759 292.551771 296.945783 301.346064 302.621007 300.068969 292.110791 297.882618 303.161016 303.555412 299.639197 416.508387 427.792698 445.458675 482.899051 476.327880 274.309069 274.750930 278.752231 289.945510 289.588822 0.000000 62.000000 103.068182 155.454545 129.727273 2.329185e+07
max 34.428182 16.846364 28.399091 26.381818 36.875455 426.173970 427.486894 429.693146 455.566337 453.910406 3.000000 381.818182 642.181818 990.000000 791.090909 104688.396307 104856.285511 104693.185369 104244.932528 104249.968040 60.327273 58.876881 59.915362 59.309182 60.529133 0.018809 0.019533 0.020985 0.021932 0.023318 1.920909 2.370000 2.449091 2.146364 1.957273 1.920136 2.369282 2.450482 2.146409 1.956655 304.480122 304.792880 311.277519 312.660564 312.668726 300.350930 299.724509 299.735546 310.815957 308.761763 300.344230 304.773410 311.272270 312.595520 308.827304 299.869093 306.834309 315.964081 313.965757 308.270147 470.753102 469.429213 504.584351 555.704024 542.529280 318.245345 311.991660 315.569164 328.920274 327.253141 1.000000 92.272727 192.636364 450.636364 313.909091 3.122700e+07
In [74]:
train_df.shape
Out[74]:
(3650, 76)
In [75]:
train_df.head()
Out[75]:
apcp_sf1_1 apcp_sf2_1 apcp_sf3_1 apcp_sf4_1 apcp_sf5_1 dlwrf_s1_1 dlwrf_s2_1 dlwrf_s3_1 dlwrf_s4_1 dlwrf_s5_1 dswrf_s1_1 dswrf_s2_1 dswrf_s3_1 dswrf_s4_1 dswrf_s5_1 pres_ms1_1 pres_ms2_1 pres_ms3_1 pres_ms4_1 pres_ms5_1 pwat_ea1_1 pwat_ea2_1 pwat_ea3_1 pwat_ea4_1 pwat_ea5_1 spfh_2m1_1 spfh_2m2_1 spfh_2m3_1 spfh_2m4_1 spfh_2m5_1 tcdc_ea1_1 tcdc_ea2_1 tcdc_ea3_1 tcdc_ea4_1 tcdc_ea5_1 tcolc_e1_1 tcolc_e2_1 tcolc_e3_1 tcolc_e4_1 tcolc_e5_1 tmax_2m1_1 tmax_2m2_1 tmax_2m3_1 tmax_2m4_1 tmax_2m5_1 tmin_2m1_1 tmin_2m2_1 tmin_2m3_1 tmin_2m4_1 tmin_2m5_1 tmp_2m_1_1 tmp_2m_2_1 tmp_2m_3_1 tmp_2m_4_1 tmp_2m_5_1 tmp_sfc1_1 tmp_sfc2_1 tmp_sfc3_1 tmp_sfc4_1 tmp_sfc5_1 ulwrf_s1_1 ulwrf_s2_1 ulwrf_s3_1 ulwrf_s4_1 ulwrf_s5_1 ulwrf_t1_1 ulwrf_t2_1 ulwrf_t3_1 ulwrf_t4_1 ulwrf_t5_1 uswrf_s1_1 uswrf_s2_1 uswrf_s3_1 uswrf_s4_1 uswrf_s5_1 salida
V1 0.0 0.0 0.0 0.000000 0.0 268.583582 244.241641 251.174486 269.741308 268.377441 0.0 30.0 220.000000 510.000000 330.000000 101832.056108 102053.159091 102090.046165 101934.175426 101988.003551 5.879193 7.018182 8.460800 9.418182 9.727869 0.003229 0.002993 0.003775 0.003870 0.003855 0.000000 0.000000 0.000000 0.000000 0.000909 0.000818 0.000264 0.000255 0.000500 0.002218 280.789784 279.627444 285.727761 286.881681 286.885823 279.198020 278.472615 278.474720 285.799685 280.966961 279.249256 279.612202 285.742784 286.841053 280.960865 277.278370 279.250383 288.826760 288.596086 278.500078 341.122231 335.067918 354.626126 397.774053 383.281225 222.153166 252.504475 254.760271 263.342404 260.067843 0.0 10.000000 50.000000 106.636364 72.000000 11930700
V2 0.0 0.0 0.0 0.008182 0.2 251.725869 255.824126 272.163913 318.259924 307.929083 0.0 30.0 173.636364 333.636364 224.545455 101425.883523 101284.509233 101253.654830 100999.313920 101424.626420 12.534339 11.987316 12.159355 12.313590 13.469729 0.003737 0.003931 0.004015 0.003994 0.004826 0.037273 0.021818 0.101818 0.084545 0.109091 0.037155 0.021309 0.102373 0.085827 0.109336 278.822329 278.063379 283.618583 286.606684 286.643397 277.258919 276.740628 276.740628 283.687009 282.111078 277.282621 278.070390 283.604600 286.554729 282.105011 275.830009 278.269459 287.048970 287.325478 281.005252 330.159915 329.354673 347.524819 388.017767 378.773804 236.836691 233.458263 233.027276 212.652054 222.052916 0.0 8.181818 35.909091 58.181818 42.090909 9778500
V3 0.0 0.0 0.0 0.000000 0.0 219.734547 211.996022 216.405820 235.529123 239.840132 0.0 30.0 220.000000 523.636364 337.545455 102253.654119 102301.918324 102088.093750 101652.815341 101543.146307 5.726770 5.458528 5.700000 7.163636 9.536364 0.002003 0.001919 0.002107 0.002431 0.002583 0.000000 0.000000 0.007273 0.007273 0.042727 0.001427 0.001582 0.007309 0.006973 0.042127 275.400091 270.222512 275.885787 279.049513 279.381653 269.756037 269.157731 269.156439 276.041792 275.301960 269.766876 270.204285 275.880818 279.064603 275.806757 269.533059 271.690993 281.759993 282.686446 273.615503 309.639845 299.751961 317.250763 364.339136 351.496665 238.655654 232.828737 235.480750 245.177331 238.893102 0.0 10.272727 55.272727 118.454545 79.181818 9771900
V4 0.0 0.0 0.0 0.000000 0.0 253.499410 230.896544 235.857221 240.274556 237.804048 0.0 30.0 208.181818 512.727273 337.181818 102110.375710 102435.603693 102688.528409 102588.876420 102598.252841 7.889904 6.768959 6.208357 5.977267 6.411838 0.002918 0.002735 0.002771 0.002821 0.002738 0.000000 0.002727 0.005455 0.000909 0.012727 0.000473 0.004018 0.007300 0.001600 0.014882 279.396046 276.176919 276.868630 278.550368 278.572038 276.175482 273.839142 273.840535 276.942990 273.802970 276.312428 274.045715 276.877749 278.571555 273.812827 274.824765 274.466433 281.291418 281.871679 272.191753 330.310971 318.761563 329.305478 360.297788 348.618319 236.784869 241.916776 243.398572 251.473036 247.503769 0.0 8.909091 46.000000 107.090909 73.636364 6466800
V5 0.0 0.0 0.0 0.000000 0.0 234.890020 238.927051 246.850822 271.577246 275.572826 0.0 30.0 220.000000 517.272727 336.363636 101750.317472 101331.333807 100921.029119 100422.514915 100309.059659 10.783448 10.425542 10.362327 8.829511 9.647615 0.003274 0.003269 0.003066 0.003483 0.003788 0.000909 0.000909 0.000909 0.014545 0.050909 0.001673 0.001836 0.001373 0.015909 0.049591 273.294803 275.018022 283.542744 288.171156 288.265137 272.858415 273.303902 273.306355 283.734819 283.735446 273.314844 274.990234 283.563099 288.178922 285.567946 272.260426 275.132668 285.698725 288.490562 283.121391 310.023179 314.763264 334.042186 388.737835 383.409776 233.641681 233.706659 239.952805 258.128188 253.200684 0.0 8.909091 48.909091 106.272727 71.818182 11545200
In [76]:
train_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 3650 entries, V1 to V3650
Data columns (total 76 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   apcp_sf1_1  3650 non-null   float64
 1   apcp_sf2_1  3650 non-null   float64
 2   apcp_sf3_1  3650 non-null   float64
 3   apcp_sf4_1  3650 non-null   float64
 4   apcp_sf5_1  3650 non-null   float64
 5   dlwrf_s1_1  3650 non-null   float64
 6   dlwrf_s2_1  3650 non-null   float64
 7   dlwrf_s3_1  3650 non-null   float64
 8   dlwrf_s4_1  3650 non-null   float64
 9   dlwrf_s5_1  3650 non-null   float64
 10  dswrf_s1_1  3650 non-null   float64
 11  dswrf_s2_1  3650 non-null   float64
 12  dswrf_s3_1  3650 non-null   float64
 13  dswrf_s4_1  3650 non-null   float64
 14  dswrf_s5_1  3650 non-null   float64
 15  pres_ms1_1  3650 non-null   float64
 16  pres_ms2_1  3650 non-null   float64
 17  pres_ms3_1  3650 non-null   float64
 18  pres_ms4_1  3650 non-null   float64
 19  pres_ms5_1  3650 non-null   float64
 20  pwat_ea1_1  3650 non-null   float64
 21  pwat_ea2_1  3650 non-null   float64
 22  pwat_ea3_1  3650 non-null   float64
 23  pwat_ea4_1  3650 non-null   float64
 24  pwat_ea5_1  3650 non-null   float64
 25  spfh_2m1_1  3650 non-null   float64
 26  spfh_2m2_1  3650 non-null   float64
 27  spfh_2m3_1  3650 non-null   float64
 28  spfh_2m4_1  3650 non-null   float64
 29  spfh_2m5_1  3650 non-null   float64
 30  tcdc_ea1_1  3650 non-null   float64
 31  tcdc_ea2_1  3650 non-null   float64
 32  tcdc_ea3_1  3650 non-null   float64
 33  tcdc_ea4_1  3650 non-null   float64
 34  tcdc_ea5_1  3650 non-null   float64
 35  tcolc_e1_1  3650 non-null   float64
 36  tcolc_e2_1  3650 non-null   float64
 37  tcolc_e3_1  3650 non-null   float64
 38  tcolc_e4_1  3650 non-null   float64
 39  tcolc_e5_1  3650 non-null   float64
 40  tmax_2m1_1  3650 non-null   float64
 41  tmax_2m2_1  3650 non-null   float64
 42  tmax_2m3_1  3650 non-null   float64
 43  tmax_2m4_1  3650 non-null   float64
 44  tmax_2m5_1  3650 non-null   float64
 45  tmin_2m1_1  3650 non-null   float64
 46  tmin_2m2_1  3650 non-null   float64
 47  tmin_2m3_1  3650 non-null   float64
 48  tmin_2m4_1  3650 non-null   float64
 49  tmin_2m5_1  3650 non-null   float64
 50  tmp_2m_1_1  3650 non-null   float64
 51  tmp_2m_2_1  3650 non-null   float64
 52  tmp_2m_3_1  3650 non-null   float64
 53  tmp_2m_4_1  3650 non-null   float64
 54  tmp_2m_5_1  3650 non-null   float64
 55  tmp_sfc1_1  3650 non-null   float64
 56  tmp_sfc2_1  3650 non-null   float64
 57  tmp_sfc3_1  3650 non-null   float64
 58  tmp_sfc4_1  3650 non-null   float64
 59  tmp_sfc5_1  3650 non-null   float64
 60  ulwrf_s1_1  3650 non-null   float64
 61  ulwrf_s2_1  3650 non-null   float64
 62  ulwrf_s3_1  3650 non-null   float64
 63  ulwrf_s4_1  3650 non-null   float64
 64  ulwrf_s5_1  3650 non-null   float64
 65  ulwrf_t1_1  3650 non-null   float64
 66  ulwrf_t2_1  3650 non-null   float64
 67  ulwrf_t3_1  3650 non-null   float64
 68  ulwrf_t4_1  3650 non-null   float64
 69  ulwrf_t5_1  3650 non-null   float64
 70  uswrf_s1_1  3650 non-null   float64
 71  uswrf_s2_1  3650 non-null   float64
 72  uswrf_s3_1  3650 non-null   float64
 73  uswrf_s4_1  3650 non-null   float64
 74  uswrf_s5_1  3650 non-null   float64
 75  salida      3650 non-null   int64  
dtypes: float64(75), int64(1)
memory usage: 2.1+ MB

3.2. Missing values¶

Fist, we check the number the total number of missing values in the dataset in order to know if we have to clean the dataset or not.

In [77]:
train_df.isna().sum()
Out[77]:
apcp_sf1_1    0
apcp_sf2_1    0
apcp_sf3_1    0
apcp_sf4_1    0
apcp_sf5_1    0
             ..
uswrf_s2_1    0
uswrf_s3_1    0
uswrf_s4_1    0
uswrf_s5_1    0
salida        0
Length: 76, dtype: int64

As we can oberve, there are no missing values in the dataset, but theres still the possibility of having missing values measured as 0's, so we will check if all those zeros make sense in the context of the dataset or not.

In [78]:
# In the plot, we can see that there are a lot of 0 values in the dataset
train_df.plot(legend=False, figsize=(15, 5))
Out[78]:
<Axes: >
In [79]:
result = train_df.eq(0.0).sum() / len(train_df) * 100

# Select those columns with more than 30% of zeros
result = result[result > 30.0]
result = result.sort_values(ascending=False)
result
Out[79]:
dswrf_s1_1    91.808219
uswrf_s1_1    90.767123
apcp_sf4_1    63.041096
apcp_sf5_1    61.041096
apcp_sf1_1    60.821918
apcp_sf2_1    59.890411
apcp_sf3_1    56.739726
tcdc_ea3_1    37.917808
tcdc_ea1_1    37.808219
tcdc_ea2_1    37.424658
tcdc_ea5_1    36.301370
tcdc_ea4_1    35.726027
dtype: float64

Observations¶

As output of the previous cell, we can see that there exist a lot of zeros in the dataset, let's analize if those zeros make sense or not.

The variables with most ammount of zeros (>30%) are:

  • dswrf_s1_1: Downward short-wave radiative flux average at the surface, at 12:00 UTC, normal to have a lot of zeros as it is not sunny at 12:00
  • uswrf_s1_1: Upward short-wave radiation at the surface, at 12:00 UTC, normal to have a lot of zeros as it is not sunny at 12:00
  • apcp_s: 3-Hour accumulated precipitation at the surface, as it is not raining every day, it is normal to have a lot of zeros
  • tcdc_ea: Total cloud cover over the entire depth of the atmosphere, as it is not cloudy every day, it is normal to have a lot of zeros

First, lets start by assigning the zeros to NaNs. By doing this we can visualize the varibles that take more values other than zero.

In [80]:
disp_df_nan = train_df.replace(0.0, np.nan)
In [81]:
""" Plotting missing values """
# Sustitute 0.0 values with NaN and plot the name of the columns with missing values
# ? msno.bar is a simple visualization of nullity by column
msno.bar(disp_df_nan, labels=True, fontsize=7, figsize=(15, 7))

# Exporting image as png to ../data/img folder
plt.savefig("../data/img/missing_values_bar.png")
In [82]:
""" Plotting the missing values in a matrix """

# ? The msno.matrix nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.
msno.matrix(disp_df_nan)

# Exporting image as png to ../data/img folder
plt.savefig("../data/img/missing_values_matrix.png")
In [83]:
""" Plotting the missing values in a heatmap """
# As in a hetmap not every value is shown, we must delimit the values to the ones with more than 30% of missing values
result = disp_df.eq(0.0).sum() / len(disp_df) * 100
result = result[result > 30.0]  # Select those columns with more than 30% of zeros
result = result.sort_values(ascending=False)
result = result.index.tolist()  # Convert to list
result

# ? The missingno correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another
msno.heatmap(disp_df_nan[result], fontsize=7, figsize=(15, 7))

# Exporting image as png to ../data/img folder
plt.savefig("../data/img/missing_values_heatmap.png")
In [84]:
""" Plotting the dendrogram """

# ? The dendrogram allows you to more fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap:
msno.dendrogram(disp_df_nan, orientation="top", fontsize=7, figsize=(15, 7))

# Exporting image as png to ../data/img folder
plt.savefig("../data/img/missing_values_dendrogram.png")

Conclusions¶

In this section, we have observe that there are no attibutes with 'Null' nor 'NaN' nor 'None' values. This indicated that at a first glance, the data is clean, at least of those datatypes.

In second place, we have observed that the attributes that we suspected could have an important number of missing values (represented by 0 instead of the previously mentioned), had instead valuable information, as we have proved along this section.
Since the data is clean and we have concluded there are no missing values, we do not need to complete them using a model or other methods, so we can move on to the next step, observing the outliers.

3.3. Outliers¶

Detecting outliers in a dataset before training a model is crucial because outliers can significantly affect the performance and accuracy of the model. Outliers are data points that deviate significantly from the rest of the dataset and can cause the model to learn incorrect patterns and relationships. When outliers are present in the data, they can also increase the variance of the model, which can result in overfitting. Overfitting occurs when the model fits too closely to the training data, leading to poor generalization to new data. Therefore, it is important to detect and handle outliers properly to ensure the model's accuracy and robustness.

In [85]:
list_of_attributes = train_df.columns.values.tolist()
#print(list_of_attributes)
In [86]:
# Boxplot with all attributes in the dataset
# sns.boxplot(data=train_df, orient="h")
# plt.show()
In [87]:
train_df.describe()
Out[87]:
apcp_sf1_1 apcp_sf2_1 apcp_sf3_1 apcp_sf4_1 apcp_sf5_1 dlwrf_s1_1 dlwrf_s2_1 dlwrf_s3_1 dlwrf_s4_1 dlwrf_s5_1 dswrf_s1_1 dswrf_s2_1 dswrf_s3_1 dswrf_s4_1 dswrf_s5_1 pres_ms1_1 pres_ms2_1 pres_ms3_1 pres_ms4_1 pres_ms5_1 pwat_ea1_1 pwat_ea2_1 pwat_ea3_1 pwat_ea4_1 pwat_ea5_1 spfh_2m1_1 spfh_2m2_1 spfh_2m3_1 spfh_2m4_1 spfh_2m5_1 tcdc_ea1_1 tcdc_ea2_1 tcdc_ea3_1 tcdc_ea4_1 tcdc_ea5_1 tcolc_e1_1 tcolc_e2_1 tcolc_e3_1 tcolc_e4_1 tcolc_e5_1 tmax_2m1_1 tmax_2m2_1 tmax_2m3_1 tmax_2m4_1 tmax_2m5_1 tmin_2m1_1 tmin_2m2_1 tmin_2m3_1 tmin_2m4_1 tmin_2m5_1 tmp_2m_1_1 tmp_2m_2_1 tmp_2m_3_1 tmp_2m_4_1 tmp_2m_5_1 tmp_sfc1_1 tmp_sfc2_1 tmp_sfc3_1 tmp_sfc4_1 tmp_sfc5_1 ulwrf_s1_1 ulwrf_s2_1 ulwrf_s3_1 ulwrf_s4_1 ulwrf_s5_1 ulwrf_t1_1 ulwrf_t2_1 ulwrf_t3_1 ulwrf_t4_1 ulwrf_t5_1 uswrf_s1_1 uswrf_s2_1 uswrf_s3_1 uswrf_s4_1 uswrf_s5_1 salida
count 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3650.000000 3.650000e+03
mean 0.610222 0.251049 0.479367 0.279969 0.525625 316.590458 316.996492 324.225574 343.169304 342.582550 0.074371 163.928966 376.718929 686.534869 508.429988 101718.580471 101774.517076 101743.013770 101538.253073 101499.397514 21.394485 21.536129 22.127195 22.595594 22.384870 0.007844 0.008848 0.009356 0.009473 0.009918 0.069240 0.067845 0.064862 0.065706 0.062366 0.069539 0.068172 0.065166 0.066036 0.062748 286.950030 288.292227 292.803749 294.483694 294.542492 284.595935 284.638684 284.617400 292.733513 291.084714 284.846286 288.227387 292.740802 294.299550 291.301035 284.094056 289.230769 295.533258 295.904819 290.366407 375.991521 381.989673 400.742449 439.104661 431.318749 247.736467 247.626828 251.950057 262.207928 261.074238 0.078107 38.716712 76.394795 127.098207 99.476613 1.638200e+07
std 2.245850 0.994112 1.756408 1.120933 1.931408 56.119896 58.129352 58.941747 61.150202 61.027007 0.305126 112.645372 159.486316 227.642854 193.753483 725.206610 731.500969 720.701217 699.477989 715.361146 12.256253 12.358856 12.583364 12.633154 12.401121 0.004398 0.005039 0.005175 0.005097 0.005456 0.167104 0.169653 0.171287 0.172516 0.166113 0.166989 0.169522 0.171172 0.172385 0.165958 8.925065 9.743169 9.898253 9.789117 9.776615 8.735982 8.862301 8.866503 9.950300 10.099684 8.722593 9.795209 9.944761 9.795537 10.083859 8.861650 9.756852 9.148308 9.317363 10.462108 46.586515 49.914820 50.766618 53.159310 54.417631 36.270918 36.289003 35.798277 38.698726 38.427066 0.258752 26.010130 30.743175 40.765618 35.505727 8.059674e+06
min 0.000000 0.000000 0.000000 0.000000 0.000000 158.971770 160.032903 165.524543 183.671312 186.342961 0.000000 0.000000 20.000000 30.000000 20.000000 99316.970881 99315.887074 99327.755682 99040.100852 98830.153409 1.100000 1.314819 1.107352 1.142803 1.201246 0.000462 0.000485 0.000451 0.000478 0.000468 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 254.589220 254.937418 258.549777 260.800365 260.863475 251.941358 249.576132 249.576714 258.698331 258.171345 251.942065 254.844406 258.552646 260.795430 258.170049 250.100794 256.360800 263.634377 264.533564 256.520408 229.296161 223.985486 246.314349 278.576630 271.707606 104.671267 113.559602 118.679132 119.393449 121.951425 0.000000 0.000000 3.181818 4.363636 2.545455 5.100000e+05
25% 0.000000 0.000000 0.000000 0.000000 0.000000 270.043573 267.583016 275.008281 292.786299 291.777096 0.000000 52.727273 240.000000 525.454545 344.477273 101266.472124 101311.399680 101283.033381 101102.175426 101049.033203 10.879000 10.718024 11.122964 11.558385 11.559638 0.003991 0.004229 0.004617 0.004736 0.004754 0.000000 0.000000 0.000000 0.000000 0.000000 0.000618 0.000564 0.000545 0.000673 0.000727 280.127903 280.413352 284.916120 286.996939 287.043590 277.844468 277.666665 277.653641 284.828555 283.228638 278.075898 280.285865 284.764714 286.841753 283.495367 277.025516 281.298508 288.689326 288.982901 281.922856 338.180517 340.208757 358.706337 398.061333 388.214025 230.536257 230.759227 234.398558 246.594118 244.431134 0.000000 14.000000 53.818182 108.818182 74.909091 1.061385e+07
50% 0.000000 0.000000 0.000000 0.000000 0.000000 319.801794 321.400251 328.456741 345.402277 345.107513 0.000000 150.000000 384.318182 730.000000 525.636364 101645.975852 101704.351207 101674.750000 101472.419389 101425.127131 19.191209 19.163636 19.650000 20.290909 20.194252 0.007246 0.008248 0.008909 0.009156 0.009518 0.004545 0.004545 0.003636 0.003636 0.002727 0.005118 0.004855 0.004145 0.004345 0.003673 287.597683 289.039583 293.757317 295.529044 295.598358 285.070623 285.192374 285.106049 293.606805 291.806785 285.381022 288.990954 293.704959 295.310534 292.050580 284.581711 290.050444 296.204518 296.669453 291.126846 376.267101 382.791372 401.524648 440.987373 433.520339 253.350231 253.394166 257.342928 270.790095 269.287814 0.000000 35.500000 79.636364 136.636364 105.454545 1.638195e+07
75% 0.114545 0.051818 0.121591 0.033636 0.090000 367.134144 370.342597 378.683015 399.545104 398.891589 0.000000 264.454545 524.636364 893.636364 693.681818 102131.380504 102188.719283 102148.666726 101940.005327 101919.873402 31.188882 31.471632 32.439831 33.103788 32.459132 0.011612 0.013523 0.014275 0.014169 0.015062 0.055455 0.056364 0.042727 0.043636 0.038182 0.056136 0.056918 0.042900 0.043523 0.038843 294.329548 296.940075 301.377123 302.732298 302.753016 292.290495 292.587547 292.576612 301.351051 299.909759 292.551771 296.945783 301.346064 302.621007 300.068969 292.110791 297.882618 303.161016 303.555412 299.639197 416.508387 427.792698 445.458675 482.899051 476.327880 274.309069 274.750930 278.752231 289.945510 289.588822 0.000000 62.000000 103.068182 155.454545 129.727273 2.329185e+07
max 34.428182 16.846364 28.399091 26.381818 36.875455 426.173970 427.486894 429.693146 455.566337 453.910406 3.000000 381.818182 642.181818 990.000000 791.090909 104688.396307 104856.285511 104693.185369 104244.932528 104249.968040 60.327273 58.876881 59.915362 59.309182 60.529133 0.018809 0.019533 0.020985 0.021932 0.023318 1.920909 2.370000 2.449091 2.146364 1.957273 1.920136 2.369282 2.450482 2.146409 1.956655 304.480122 304.792880 311.277519 312.660564 312.668726 300.350930 299.724509 299.735546 310.815957 308.761763 300.344230 304.773410 311.272270 312.595520 308.827304 299.869093 306.834309 315.964081 313.965757 308.270147 470.753102 469.429213 504.584351 555.704024 542.529280 318.245345 311.991660 315.569164 328.920274 327.253141 1.000000 92.272727 192.636364 450.636364 313.909091 3.122700e+07
In [88]:
train_df['apcp_sf1_1'].value_counts()
Out[88]:
0.000000    2220
0.000909      54
0.001818      24
0.003636      19
0.002727      19
            ... 
2.356364       1
0.920000       1
0.048182       1
0.211818       1
1.363636       1
Name: apcp_sf1_1, Length: 1170, dtype: int64

Here, by plotting the boxplots and making the outliers (fliers) visible, we are able to see some outliers in the dataset.
Take into account that the outliers are represented by the points outside the boxplot and they can be potentially wrong values or just values that are not usual in the dataset (ruido).

3.3.1. Histogram to identify outliers¶

In [89]:
""" Histogram showing the distribtuion of train_df to show the outliers """
plt.hist(train_df)
plt.show()

Here, as in the boxplot, we can see the outliers in the dataset as well as observing the right skewness of the data as we will later see more clearly in the distribution plots.

3.3.2. Boxplot to identify outliers¶

With the objective of noticing the outliers on each attribute, we create a box-plot of each of the attributes

In [90]:
""" Plotting the boxplot for each attribute and getting the outliers of each attribute """
total_outliers = []
# * We iterate over the list of attributes
for attribute in list_of_attributes:
    # * sns.regplot(x=train_df[attribute], y=train_df['total'], fit_reg=False)
    sns.boxplot(data=train_df[attribute], x=train_df[attribute], orient="h")
    # * Use the command below to show each plot (small size for visualization sake)
    # sns.set(rc={'figure.figsize':(1,.5)})
    # plt.show()
    # * All the images are saved in the folder ../data/img/box-plot
    plt.savefig(f"../data/img/box-plot/{str(attribute)}.png")

    # We obtain the a list of outliers for each attribute
    list_of_outliers = train_df[attribute][train_df[attribute] > train_df[attribute].quantile(0.75) + 1.5*(train_df[attribute].quantile(0.75) - train_df[attribute].quantile(0.25))].tolist()
    outliers = [f'{attribute} outliers'] + [len(list_of_outliers)] + [list_of_outliers]
    # * In order to print the total number of outliers for each attribute
    # print(f'{attribute} has {len(list_of_outliers)} outliers')
    # ! Data structure: [attribute, number of outliers, list of outliers]
    # print(outliers)
    total_outliers.append(outliers)

# print the first 2 elements of each element in the list -> [[atb, num],[atb, num],...]
num_atb_outliers = 0
for i in total_outliers:
    if i[1] != 0:
        num_atb_outliers += 1
        print(f"{i[0:2]}...")
        
# Number of outliers != 0 for each attribute
print(f"Total number of atributes with outliers: {num_atb_outliers} / {len(total_outliers)-1}")
['apcp_sf1_1 outliers', 693]...
['apcp_sf2_1 outliers', 674]...
['apcp_sf3_1 outliers', 677]...
['apcp_sf4_1 outliers', 761]...
['apcp_sf5_1 outliers', 709]...
['dswrf_s1_1 outliers', 299]...
['pres_ms1_1 outliers', 56]...
['pres_ms2_1 outliers', 55]...
['pres_ms3_1 outliers', 64]...
['pres_ms4_1 outliers', 68]...
['pres_ms5_1 outliers', 58]...
['tcdc_ea1_1 outliers', 514]...
['tcdc_ea2_1 outliers', 525]...
['tcdc_ea3_1 outliers', 575]...
['tcdc_ea4_1 outliers', 549]...
['tcdc_ea5_1 outliers', 559]...
['tcolc_e1_1 outliers', 513]...
['tcolc_e2_1 outliers', 523]...
['tcolc_e3_1 outliers', 575]...
['tcolc_e4_1 outliers', 555]...
['tcolc_e5_1 outliers', 560]...
['uswrf_s1_1 outliers', 337]...
['uswrf_s3_1 outliers', 3]...
['uswrf_s4_1 outliers', 31]...
['uswrf_s5_1 outliers', 9]...
Total number of atributes with outliers: 25 / 75

We managed to create a list containing the name of the atribute, the number of outliers and the value of the outliers for each attribute, calculated by applying the IQR method.
This is relevant as we managed to create a 'total_outliers' variable that contains the list data structures of all the different outliers from all the attributes, so that it can be easily accessed in a future to remove the outliers from the dataset if needed for testing purposes.

As suspected, we can see that there are a lot of outliers in the dataset, therefore it is plausible that some of them are noise, thus possibly being removed in a future model in order to improve it (either by hand or by selection in the preprocess pipeline).
Now, we need to analyze if they are the result of bad measurements or if they are significant data for the analysis.

Additionaly, as we will see later, this amount of outliers indicate us that probably a Robust Scaler will be more appropriate than using a Standard Scaler, as the Robust Scaler is more robust to outliers than the Standard Scaler, thus being more appropriate for this dataset model.

3.3.3. Skewness and Kurtosis to identify outliers¶

Skewness and kurtosis are commonly used to measure the shape of a distribution. Skewness measures the degree of asymmetry in the distribution, while kurtosis measures the degree of flatness in the distribution compared to a normal distribution. We will look for observations that are far from the central tendency of the distribution and may indicate the presence of extreme values or data points that do not fit the pattern of the majority of the data (which as expected, happens to be the case in this dataset).

In [91]:
""" Skewness """
# ? skewness: measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.
train_df.skew().sort_values(ascending=False)
Out[91]:
apcp_sf4_1    9.297678
apcp_sf2_1    7.610005
apcp_sf5_1    7.244491
apcp_sf3_1    7.241727
apcp_sf1_1    6.783553
                ...   
ulwrf_t1_1   -0.964701
ulwrf_t3_1   -0.989917
ulwrf_t2_1   -1.001763
ulwrf_t5_1   -1.071147
ulwrf_t4_1   -1.196425
Length: 76, dtype: float64
In [92]:
""" Kurtosis """
# ? kurtosis: measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution.
train_df.kurt().sort_values(ascending=False)
Out[92]:
apcp_sf4_1    138.601323
apcp_sf2_1     79.535762
apcp_sf5_1     78.321580
apcp_sf3_1     72.498316
apcp_sf1_1     61.204708
                 ...    
uswrf_s2_1     -1.306893
spfh_2m2_1     -1.320499
spfh_2m5_1     -1.321073
dswrf_s2_1     -1.323864
spfh_2m3_1     -1.329847
Length: 76, dtype: float64
In [93]:
y = train_df["apcp_sf4_1"]
plt.figure(1)
plt.title("Normal")
sns.distplot(y, kde=True, fit=st.norm)
plt.figure(2)
plt.title("Log Normal")
sns.distplot(y, kde=True, fit=st.lognorm)
/tmp/ipykernel_7049/2978756091.py:4: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(y, kde=True, fit=st.norm)
/tmp/ipykernel_7049/2978756091.py:7: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(y, kde=True, fit=st.lognorm)
Out[93]:
<Axes: title={'center': 'Log Normal'}, xlabel='apcp_sf4_1', ylabel='Density'>
In [94]:
sns.distplot(train_df.skew(), color="blue", axlabel="Skewness")
/tmp/ipykernel_7049/388743980.py:1: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(train_df.skew(), color="blue", axlabel="Skewness")
Out[94]:
<Axes: xlabel='Skewness', ylabel='Density'>
In [95]:
plt.figure(figsize=(12, 8))
sns.distplot(
    train_df.kurt(), color="r", axlabel="Kurtosis", norm_hist=False, kde=True, rug=False
)
# plt.hist(train.kurt(),orientation = 'vertical',histtype = 'bar',label ='Kurtosis', color ='blue')
plt.show()
/tmp/ipykernel_7049/4054214216.py:2: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(

3.4. Correlation¶

In this section we are getting information about the correlation of the variables between them. This information is valuable in order to make good decisions when deleting redundant attributes. Also note we are getting information about the correlation between each attribute and the solution variable. This allows us to know the most relevant attributes, making the best decisions when creating the different models.

In [96]:
correlation = train_df.corr()
correlation = abs(correlation)
print(correlation.shape)  # 76 x 76 matrix of correlation values
(76, 76)

Getting the correlation matrix formatted into our own data structure¶

This is done for the sake of simplicity and to be able to visualize the correlation matrix in a more intuitive way.

In [97]:
correlation_list = []

for column in train_df.columns:
    correlation.loc[:, column] = abs(
        correlation.iloc[:, train_df.columns.get_loc(column)]
    )
    mask = correlation.loc[:, column] > 0.95
    # print(correlation[column][mask].sort_values(ascending = False))

    # Translate the comment below to English:
    # we add the correlation values to a list of lists, which contains the names of the correlated columns and their correlation index

    # The first segment adds the name of the column we are analyzing
    # The second segment adds the names of the columns correlated (except the column we are analyzing) > 0.95
    # The third segment adds the correlation index of the columns correlated (except the column we are analyzing) > 0.95
    # Second and third segment are added to the first segment as a list of lists

    # First we need to create a dictionary with the column names and their correlation values (except the column we are analyzing)
    dict = {
        key: value
        for key, value in correlation.loc[column, mask]
        .sort_values(ascending=False)
        .iloc[1:]
        .to_dict()
        .items()
    }
    # print (dict)

    # Then we create a list of lists with the column names and their correlation values from the dictionary created above
    corr_list = [[key] + [value] for key, value in dict.items()]
    # Finally we add the name of the column we are analyzing to the list of lists created above as the first element of the list (str)
    corr_list.insert(0, ["Columna: " + column])

    # ! Data structure: [[columna, [columna correlada 1, indice de correlacion], [columna correlada 2, indice de correlacion], ...], ...]
    print(corr_list)

    correlation_list += [corr_list]
print(correlation_list)
[['Columna: apcp_sf1_1']]
[['Columna: apcp_sf2_1']]
[['Columna: apcp_sf3_1']]
[['Columna: apcp_sf4_1']]
[['Columna: apcp_sf5_1']]
[['Columna: dlwrf_s1_1'], ['dlwrf_s2_1', 0.9650067922254768], ['dlwrf_s3_1', 0.9547817730760655]]
[['Columna: dlwrf_s2_1'], ['dlwrf_s3_1', 0.993701215706055], ['dlwrf_s1_1', 0.9650067922254768]]
[['Columna: dlwrf_s3_1'], ['dlwrf_s2_1', 0.993701215706055], ['dlwrf_s4_1', 0.9659874690575408], ['dlwrf_s5_1', 0.9552712673845433], ['dlwrf_s1_1', 0.9547817730760655]]
[['Columna: dlwrf_s4_1'], ['dlwrf_s5_1', 0.9969222914149775], ['dlwrf_s3_1', 0.9659874690575408]]
[['Columna: dlwrf_s5_1'], ['dlwrf_s4_1', 0.9969222914149775], ['dlwrf_s3_1', 0.9552712673845433]]
[['Columna: dswrf_s1_1']]
[['Columna: dswrf_s2_1'], ['uswrf_s2_1', 0.9911709851006711], ['dswrf_s3_1', 0.9503896354343679]]
[['Columna: dswrf_s3_1'], ['uswrf_s2_1', 0.9591814530708258], ['dswrf_s2_1', 0.9503896354343679]]
[['Columna: dswrf_s4_1'], ['dswrf_s5_1', 0.982758557897581]]
[['Columna: dswrf_s5_1'], ['dswrf_s4_1', 0.982758557897581]]
[['Columna: pres_ms1_1'], ['pres_ms2_1', 0.9879236602955379], ['pres_ms3_1', 0.956852960202746]]
[['Columna: pres_ms2_1'], ['pres_ms1_1', 0.9879236602955379], ['pres_ms3_1', 0.9869377705171734], ['pres_ms4_1', 0.9536176398645005]]
[['Columna: pres_ms3_1'], ['pres_ms2_1', 0.9869377705171734], ['pres_ms4_1', 0.9866602703072012], ['pres_ms1_1', 0.956852960202746], ['pres_ms5_1', 0.9538147697170144]]
[['Columna: pres_ms4_1'], ['pres_ms3_1', 0.9866602703072012], ['pres_ms5_1', 0.9851755074525863], ['pres_ms2_1', 0.9536176398645005]]
[['Columna: pres_ms5_1'], ['pres_ms4_1', 0.9851755074525863], ['pres_ms3_1', 0.9538147697170144]]
[['Columna: pwat_ea1_1'], ['pwat_ea2_1', 0.9859484994851248], ['pwat_ea3_1', 0.9577107162594556]]
[['Columna: pwat_ea2_1'], ['pwat_ea3_1', 0.9874259658433963], ['pwat_ea1_1', 0.9859484994851248], ['pwat_ea4_1', 0.9618712300670131]]
[['Columna: pwat_ea3_1'], ['pwat_ea4_1', 0.9880603787665849], ['pwat_ea2_1', 0.9874259658433963], ['pwat_ea5_1', 0.9616424908340101], ['pwat_ea1_1', 0.9577107162594556]]
[['Columna: pwat_ea4_1'], ['pwat_ea3_1', 0.9880603787665849], ['pwat_ea5_1', 0.986763801908917], ['pwat_ea2_1', 0.9618712300670131]]
[['Columna: pwat_ea5_1'], ['pwat_ea4_1', 0.986763801908917], ['pwat_ea3_1', 0.9616424908340101]]
[['Columna: spfh_2m1_1'], ['spfh_2m2_1', 0.9742691195680059]]
[['Columna: spfh_2m2_1'], ['spfh_2m3_1', 0.9846069576918387], ['spfh_2m1_1', 0.9742691195680059], ['spfh_2m4_1', 0.9600698332225309]]
[['Columna: spfh_2m3_1'], ['spfh_2m4_1', 0.9891201306737782], ['spfh_2m2_1', 0.9846069576918387], ['spfh_2m5_1', 0.9771699520274281]]
[['Columna: spfh_2m4_1'], ['spfh_2m5_1', 0.9904262248914517], ['spfh_2m3_1', 0.9891201306737782], ['spfh_2m2_1', 0.9600698332225309]]
[['Columna: spfh_2m5_1'], ['spfh_2m4_1', 0.9904262248914517], ['spfh_2m3_1', 0.9771699520274281]]
[['Columna: tcdc_ea1_1'], ['tcolc_e1_1', 0.9999826963362115]]
[['Columna: tcdc_ea2_1'], ['tcolc_e2_1', 0.9999837132775715]]
[['Columna: tcdc_ea3_1'], ['tcolc_e3_1', 0.9999845616560729]]
[['Columna: tcdc_ea4_1'], ['tcolc_e4_1', 0.999984785893167]]
[['Columna: tcdc_ea5_1'], ['tcolc_e5_1', 0.9999746391911669]]
[['Columna: tcolc_e1_1'], ['tcdc_ea1_1', 0.9999826963362115]]
[['Columna: tcolc_e2_1'], ['tcdc_ea2_1', 0.9999837132775715]]
[['Columna: tcolc_e3_1'], ['tcdc_ea3_1', 0.9999845616560729]]
[['Columna: tcolc_e4_1'], ['tcdc_ea4_1', 0.999984785893167]]
[['Columna: tcolc_e5_1'], ['tcdc_ea5_1', 0.9999746391911669]]
[['Columna: tmax_2m1_1'], ['ulwrf_s1_1', 0.9925465923917536], ['tmin_2m1_1', 0.9864627566914622], ['tmp_2m_1_1', 0.9844648786308661], ['tmp_sfc1_1', 0.9826171735169642], ['tmin_2m2_1', 0.9794577781348043], ['tmin_2m3_1', 0.97879484323375], ['ulwrf_s2_1', 0.9707221719806952], ['tmax_2m2_1', 0.9637997824764319], ['tmp_2m_2_1', 0.9602996922677912], ['ulwrf_s3_1', 0.9578768560091268], ['tmp_sfc2_1', 0.9528200105400535]]
[['Columna: tmax_2m2_1'], ['tmp_2m_2_1', 0.9993112900738094], ['tmp_sfc2_1', 0.9963595768447152], ['ulwrf_s3_1', 0.9943559569788851], ['ulwrf_s2_1', 0.9917542607726626], ['tmax_2m3_1', 0.9863610206089146], ['tmp_2m_3_1', 0.9829472636912582], ['tmin_2m2_1', 0.9819461458841241], ['tmin_2m3_1', 0.9819334574863472], ['tmin_2m4_1', 0.9790285687297058], ['tmp_2m_1_1', 0.9771397511742945], ['tmin_2m1_1', 0.9735093237780129], ['tmin_2m5_1', 0.9713195080332866], ['tmax_2m4_1', 0.9698857690834339], ['tmax_2m5_1', 0.9697541323437304], ['tmp_sfc1_1', 0.9697293778574791], ['tmp_sfc5_1', 0.9692885564165558], ['tmp_2m_5_1', 0.9675902215544175], ['tmp_2m_4_1', 0.9645870257681579], ['tmax_2m1_1', 0.9637997824764319], ['ulwrf_s1_1', 0.9606784450210313], ['ulwrf_s5_1', 0.9582268665523904], ['tmp_sfc3_1', 0.9540154794130064]]
[['Columna: tmax_2m3_1'], ['tmp_2m_3_1', 0.9989979986044517], ['tmin_2m4_1', 0.9973034556239585], ['tmax_2m4_1', 0.9937711137699543], ['tmax_2m5_1', 0.9931925682521422], ['tmp_2m_4_1', 0.9902875444682745], ['ulwrf_s3_1', 0.9882838798756105], ['tmp_2m_2_1', 0.9880399268817291], ['tmp_sfc2_1', 0.9874628996347063], ['tmax_2m2_1', 0.9863610206089146], ['tmin_2m5_1', 0.9849440628695366], ['tmp_sfc3_1', 0.9836268315982332], ['ulwrf_s5_1', 0.9831073628529786], ['tmp_2m_5_1', 0.9800734262132967], ['ulwrf_s4_1', 0.9784038012441758], ['tmp_sfc4_1', 0.977121003001589], ['tmp_sfc5_1', 0.9751782371847001], ['ulwrf_s2_1', 0.966905126635995], ['tmin_2m3_1', 0.9565532246066281], ['tmin_2m2_1', 0.9554925734198499]]
[['Columna: tmax_2m4_1'], ['tmax_2m5_1', 0.999855391919824], ['tmp_2m_4_1', 0.9989670084606509], ['tmin_2m4_1', 0.9965267117336937], ['tmp_2m_3_1', 0.9954824569443925], ['tmax_2m3_1', 0.9937711137699543], ['ulwrf_s5_1', 0.9915793553887863], ['tmp_sfc4_1', 0.9903005164322048], ['tmin_2m5_1', 0.9888950622930037], ['ulwrf_s4_1', 0.987077087459209], ['tmp_2m_5_1', 0.9862698111992504], ['tmp_sfc3_1', 0.9857825429301692], ['tmp_sfc5_1', 0.9780961580272722], ['ulwrf_s3_1', 0.975826585247475], ['tmp_sfc2_1', 0.9734162968894661], ['tmp_2m_2_1', 0.9730783300205592], ['tmax_2m2_1', 0.9698857690834339]]
[['Columna: tmax_2m5_1'], ['tmax_2m4_1', 0.999855391919824], ['tmp_2m_4_1', 0.9988753498975144], ['tmin_2m4_1', 0.9959215026917639], ['tmp_2m_3_1', 0.9948623637802105], ['tmax_2m3_1', 0.9931925682521422], ['ulwrf_s5_1', 0.99151293872598], ['tmp_sfc4_1', 0.9902303273948138], ['tmin_2m5_1', 0.9891405894645761], ['tmp_2m_5_1', 0.9872536923433799], ['ulwrf_s4_1', 0.9863521279974694], ['tmp_sfc3_1', 0.9848432369308402], ['tmp_sfc5_1', 0.9791492889290042], ['ulwrf_s3_1', 0.975536953838263], ['tmp_sfc2_1', 0.9732760733014082], ['tmp_2m_2_1', 0.9729567511523166], ['tmax_2m2_1', 0.9697541323437304]]
[['Columna: tmin_2m1_1'], ['tmp_2m_1_1', 0.9981185076252812], ['tmp_sfc1_1', 0.9963142972792025], ['tmin_2m2_1', 0.9950949108277914], ['tmin_2m3_1', 0.9943408586923532], ['ulwrf_s1_1', 0.9926165091998508], ['tmax_2m1_1', 0.9864627566914622], ['ulwrf_s2_1', 0.9814254257054185], ['tmax_2m2_1', 0.9735093237780129], ['tmp_2m_2_1', 0.9708467193246784], ['ulwrf_s3_1', 0.9654648681213764], ['tmp_sfc2_1', 0.9605896048258724]]
[['Columna: tmin_2m2_1'], ['tmin_2m3_1', 0.9996572623026859], ['tmp_2m_1_1', 0.9981819106379802], ['tmp_sfc1_1', 0.9960749292152973], ['tmin_2m1_1', 0.9950949108277914], ['ulwrf_s2_1', 0.9899721168841563], ['ulwrf_s1_1', 0.9850580289147287], ['tmax_2m2_1', 0.9819461458841241], ['tmp_2m_2_1', 0.9808603480091717], ['tmax_2m1_1', 0.9794577781348043], ['ulwrf_s3_1', 0.9743529570168544], ['tmp_sfc2_1', 0.9709866438853475], ['tmax_2m3_1', 0.9554925734198499], ['tmp_2m_3_1', 0.9516543862813964]]
[['Columna: tmin_2m3_1'], ['tmin_2m2_1', 0.9996572623026859], ['tmp_2m_1_1', 0.9973859703817012], ['tmp_sfc1_1', 0.9951903927420194], ['tmin_2m1_1', 0.9943408586923532], ['ulwrf_s2_1', 0.9897318251363909], ['ulwrf_s1_1', 0.9843176374301973], ['tmax_2m2_1', 0.9819334574863472], ['tmp_2m_2_1', 0.9813049765366397], ['tmax_2m1_1', 0.97879484323375], ['ulwrf_s3_1', 0.9751256747221002], ['tmp_sfc2_1', 0.9715874768684567], ['tmax_2m3_1', 0.9565532246066281], ['tmp_2m_3_1', 0.9537926818714694]]
[['Columna: tmin_2m4_1'], ['tmp_2m_3_1', 0.9989480053798309], ['tmax_2m3_1', 0.9973034556239585], ['tmax_2m4_1', 0.9965267117336937], ['tmax_2m5_1', 0.9959215026917639], ['tmp_2m_4_1', 0.9955546180266533], ['tmin_2m5_1', 0.9884708317878447], ['ulwrf_s5_1', 0.9881522932211056], ['tmp_sfc3_1', 0.985897294709102], ['tmp_sfc4_1', 0.9839686317112281], ['ulwrf_s4_1', 0.9837254094808376], ['tmp_2m_5_1', 0.9836800034039297], ['ulwrf_s3_1', 0.9834305304179148], ['tmp_2m_2_1', 0.9819192741530762], ['tmp_sfc2_1', 0.9818609853360799], ['tmax_2m2_1', 0.9790285687297058], ['tmp_sfc5_1', 0.9773521584222015], ['ulwrf_s2_1', 0.9587402415390521]]
[['Columna: tmin_2m5_1'], ['tmp_2m_5_1', 0.9985248666465218], ['tmp_sfc5_1', 0.9960054258735697], ['tmp_2m_4_1', 0.9895327574093381], ['tmax_2m5_1', 0.9891405894645761], ['tmax_2m4_1', 0.9888950622930037], ['tmin_2m4_1', 0.9884708317878447], ['ulwrf_s5_1', 0.9873222874147335], ['tmp_2m_3_1', 0.9863630955598195], ['tmax_2m3_1', 0.9849440628695366], ['tmp_sfc4_1', 0.9799245981499726], ['ulwrf_s3_1', 0.9771636437810954], ['tmp_sfc2_1', 0.975307217691161], ['tmp_2m_2_1', 0.973997775172227], ['ulwrf_s4_1', 0.9737488476761204], ['tmax_2m2_1', 0.9713195080332866], ['tmp_sfc3_1', 0.9704219757785946], ['ulwrf_s2_1', 0.9584808725553905]]
[['Columna: tmp_2m_1_1'], ['tmin_2m2_1', 0.9981819106379802], ['tmin_2m1_1', 0.9981185076252812], ['tmp_sfc1_1', 0.9979448017549044], ['tmin_2m3_1', 0.9973859703817012], ['ulwrf_s1_1', 0.9897625110599417], ['ulwrf_s2_1', 0.9853437253651708], ['tmax_2m1_1', 0.9844648786308661], ['tmax_2m2_1', 0.9771397511742945], ['tmp_2m_2_1', 0.9744909003873826], ['ulwrf_s3_1', 0.968308266178652], ['tmp_sfc2_1', 0.9638593940360212]]
[['Columna: tmp_2m_2_1'], ['tmax_2m2_1', 0.9993112900738094], ['tmp_sfc2_1', 0.9971658992806404], ['ulwrf_s3_1', 0.994870585784455], ['ulwrf_s2_1', 0.990609269111517], ['tmax_2m3_1', 0.9880399268817291], ['tmp_2m_3_1', 0.9856921834337115], ['tmin_2m4_1', 0.9819192741530762], ['tmin_2m3_1', 0.9813049765366397], ['tmin_2m2_1', 0.9808603480091717], ['tmp_2m_1_1', 0.9744909003873826], ['tmin_2m5_1', 0.973997775172227], ['tmax_2m4_1', 0.9730783300205592], ['tmax_2m5_1', 0.9729567511523166], ['tmp_sfc5_1', 0.9714528776518785], ['tmin_2m1_1', 0.9708467193246784], ['tmp_2m_5_1', 0.9702697100229983], ['tmp_2m_4_1', 0.9679046544965666], ['tmp_sfc1_1', 0.9668200402557023], ['ulwrf_s5_1', 0.960980140611558], ['tmax_2m1_1', 0.9602996922677912], ['ulwrf_s1_1', 0.9573813147342563], ['tmp_sfc3_1', 0.9570765900532514]]
[['Columna: tmp_2m_3_1'], ['tmax_2m3_1', 0.9989979986044517], ['tmin_2m4_1', 0.9989480053798309], ['tmax_2m4_1', 0.9954824569443925], ['tmax_2m5_1', 0.9948623637802105], ['tmp_2m_4_1', 0.9925508354894933], ['ulwrf_s3_1', 0.986475159576972], ['tmin_2m5_1', 0.9863630955598195], ['tmp_2m_2_1', 0.9856921834337115], ['tmp_sfc3_1', 0.9856020769129077], ['tmp_sfc2_1', 0.9853721430018374], ['ulwrf_s5_1', 0.9850287211532148], ['tmax_2m2_1', 0.9829472636912582], ['tmp_2m_5_1', 0.9813875596975576], ['ulwrf_s4_1', 0.9807390701133252], ['tmp_sfc4_1', 0.9796642142509698], ['tmp_sfc5_1', 0.9758811197330793], ['ulwrf_s2_1', 0.9631310421561378], ['tmin_2m3_1', 0.9537926818714694], ['tmin_2m2_1', 0.9516543862813964]]
[['Columna: tmp_2m_4_1'], ['tmax_2m4_1', 0.9989670084606509], ['tmax_2m5_1', 0.9988753498975144], ['tmin_2m4_1', 0.9955546180266533], ['ulwrf_s5_1', 0.9926662202081515], ['tmp_2m_3_1', 0.9925508354894933], ['tmp_sfc4_1', 0.9924835965861123], ['tmax_2m3_1', 0.9902875444682745], ['tmin_2m5_1', 0.9895327574093381], ['ulwrf_s4_1', 0.9875947598774877], ['tmp_2m_5_1', 0.9871404965220815], ['tmp_sfc3_1', 0.9838690005793208], ['tmp_sfc5_1', 0.9781934487046426], ['ulwrf_s3_1', 0.9712226930987012], ['tmp_sfc2_1', 0.9684774894714101], ['tmp_2m_2_1', 0.9679046544965666], ['tmax_2m2_1', 0.9645870257681579]]
[['Columna: tmp_2m_5_1'], ['tmin_2m5_1', 0.9985248666465218], ['tmp_sfc5_1', 0.9980075740584928], ['tmax_2m5_1', 0.9872536923433799], ['tmp_2m_4_1', 0.9871404965220815], ['tmax_2m4_1', 0.9862698111992504], ['ulwrf_s5_1', 0.9852702329370912], ['tmin_2m4_1', 0.9836800034039297], ['tmp_2m_3_1', 0.9813875596975576], ['tmax_2m3_1', 0.9800734262132967], ['tmp_sfc4_1', 0.9781233238955636], ['ulwrf_s3_1', 0.9731828651939985], ['tmp_sfc2_1', 0.971798288488127], ['tmp_2m_2_1', 0.9702697100229983], ['ulwrf_s4_1', 0.9695091232286572], ['tmax_2m2_1', 0.9675902215544175], ['tmp_sfc3_1', 0.9646544198978636], ['ulwrf_s2_1', 0.9560882131646383]]
[['Columna: tmp_sfc1_1'], ['tmp_2m_1_1', 0.9979448017549044], ['tmin_2m1_1', 0.9963142972792025], ['tmin_2m2_1', 0.9960749292152973], ['tmin_2m3_1', 0.9951903927420194], ['ulwrf_s1_1', 0.9919281097281679], ['ulwrf_s2_1', 0.9842676480547997], ['tmax_2m1_1', 0.9826171735169642], ['tmax_2m2_1', 0.9697293778574791], ['tmp_2m_2_1', 0.9668200402557023], ['ulwrf_s3_1', 0.9621315029521358], ['tmp_sfc2_1', 0.957188784436685]]
[['Columna: tmp_sfc2_1'], ['tmp_2m_2_1', 0.9971658992806404], ['ulwrf_s3_1', 0.996855293888584], ['tmax_2m2_1', 0.9963595768447152], ['ulwrf_s2_1', 0.9879322271553737], ['tmax_2m3_1', 0.9874628996347063], ['tmp_2m_3_1', 0.9853721430018374], ['tmin_2m4_1', 0.9818609853360799], ['tmin_2m5_1', 0.975307217691161], ['tmp_sfc5_1', 0.9740593607013978], ['tmax_2m4_1', 0.9734162968894661], ['tmax_2m5_1', 0.9732760733014082], ['tmp_2m_5_1', 0.971798288488127], ['tmin_2m3_1', 0.9715874768684567], ['tmin_2m2_1', 0.9709866438853475], ['tmp_2m_4_1', 0.9684774894714101], ['ulwrf_s5_1', 0.9670339151339817], ['tmp_sfc3_1', 0.9649823936827109], ['tmp_2m_1_1', 0.9638593940360212], ['tmin_2m1_1', 0.9605896048258724], ['ulwrf_s4_1', 0.9572236269499521], ['tmp_sfc1_1', 0.957188784436685], ['tmp_sfc4_1', 0.9555363809053586], ['tmax_2m1_1', 0.9528200105400535]]
[['Columna: tmp_sfc3_1'], ['ulwrf_s4_1', 0.9947627885687921], ['ulwrf_s5_1', 0.9892581590922543], ['tmp_sfc4_1', 0.9884745475700606], ['tmin_2m4_1', 0.985897294709102], ['tmax_2m4_1', 0.9857825429301692], ['tmp_2m_3_1', 0.9856020769129077], ['tmax_2m5_1', 0.9848432369308402], ['tmp_2m_4_1', 0.9838690005793208], ['tmax_2m3_1', 0.9836268315982332], ['tmin_2m5_1', 0.9704219757785946], ['ulwrf_s3_1', 0.9698096931439735], ['tmp_sfc2_1', 0.9649823936827109], ['tmp_2m_5_1', 0.9646544198978636], ['tmp_2m_2_1', 0.9570765900532514], ['tmp_sfc5_1', 0.9562755066656142], ['tmax_2m2_1', 0.9540154794130064]]
[['Columna: tmp_sfc4_1'], ['ulwrf_s5_1', 0.996612200039398], ['ulwrf_s4_1', 0.9957411309514121], ['tmp_2m_4_1', 0.9924835965861123], ['tmax_2m4_1', 0.9903005164322048], ['tmax_2m5_1', 0.9902303273948138], ['tmp_sfc3_1', 0.9884745475700606], ['tmin_2m4_1', 0.9839686317112281], ['tmin_2m5_1', 0.9799245981499726], ['tmp_2m_3_1', 0.9796642142509698], ['tmp_2m_5_1', 0.9781233238955636], ['tmax_2m3_1', 0.977121003001589], ['tmp_sfc5_1', 0.9687614236330414], ['ulwrf_s3_1', 0.9604279752910915], ['tmp_sfc2_1', 0.9555363809053586]]
[['Columna: tmp_sfc5_1'], ['tmp_2m_5_1', 0.9980075740584928], ['tmin_2m5_1', 0.9960054258735697], ['tmax_2m5_1', 0.9791492889290042], ['ulwrf_s5_1', 0.9789322162596626], ['tmp_2m_4_1', 0.9781934487046426], ['tmax_2m4_1', 0.9780961580272722], ['tmin_2m4_1', 0.9773521584222015], ['tmp_2m_3_1', 0.9758811197330793], ['tmax_2m3_1', 0.9751782371847001], ['ulwrf_s3_1', 0.9744478191517459], ['tmp_sfc2_1', 0.9740593607013978], ['tmp_2m_2_1', 0.9714528776518785], ['tmax_2m2_1', 0.9692885564165558], ['tmp_sfc4_1', 0.9687614236330414], ['ulwrf_s2_1', 0.9619766387479004], ['ulwrf_s4_1', 0.960457161005563], ['tmp_sfc3_1', 0.9562755066656142]]
[['Columna: ulwrf_s1_1'], ['tmin_2m1_1', 0.9926165091998508], ['tmax_2m1_1', 0.9925465923917536], ['tmp_sfc1_1', 0.9919281097281679], ['tmp_2m_1_1', 0.9897625110599417], ['tmin_2m2_1', 0.9850580289147287], ['tmin_2m3_1', 0.9843176374301973], ['ulwrf_s2_1', 0.9762674995860944], ['tmax_2m2_1', 0.9606784450210313], ['ulwrf_s3_1', 0.9574576992187499], ['tmp_2m_2_1', 0.9573813147342563]]
[['Columna: ulwrf_s2_1'], ['tmax_2m2_1', 0.9917542607726626], ['tmp_2m_2_1', 0.990609269111517], ['ulwrf_s3_1', 0.990564307766117], ['tmin_2m2_1', 0.9899721168841563], ['tmin_2m3_1', 0.9897318251363909], ['tmp_sfc2_1', 0.9879322271553737], ['tmp_2m_1_1', 0.9853437253651708], ['tmp_sfc1_1', 0.9842676480547997], ['tmin_2m1_1', 0.9814254257054185], ['ulwrf_s1_1', 0.9762674995860944], ['tmax_2m1_1', 0.9707221719806952], ['tmax_2m3_1', 0.966905126635995], ['tmp_2m_3_1', 0.9631310421561378], ['tmp_sfc5_1', 0.9619766387479004], ['tmin_2m4_1', 0.9587402415390521], ['tmin_2m5_1', 0.9584808725553905], ['tmp_2m_5_1', 0.9560882131646383]]
[['Columna: ulwrf_s3_1'], ['tmp_sfc2_1', 0.996855293888584], ['tmp_2m_2_1', 0.994870585784455], ['tmax_2m2_1', 0.9943559569788851], ['ulwrf_s2_1', 0.990564307766117], ['tmax_2m3_1', 0.9882838798756105], ['tmp_2m_3_1', 0.986475159576972], ['tmin_2m4_1', 0.9834305304179148], ['tmin_2m5_1', 0.9771636437810954], ['tmax_2m4_1', 0.975826585247475], ['tmax_2m5_1', 0.975536953838263], ['tmin_2m3_1', 0.9751256747221002], ['tmp_sfc5_1', 0.9744478191517459], ['tmin_2m2_1', 0.9743529570168544], ['ulwrf_s5_1', 0.9734148064192735], ['tmp_2m_5_1', 0.9731828651939985], ['tmp_2m_4_1', 0.9712226930987012], ['tmp_sfc3_1', 0.9698096931439735], ['tmp_2m_1_1', 0.968308266178652], ['tmin_2m1_1', 0.9654648681213764], ['ulwrf_s4_1', 0.9651706956885256], ['tmp_sfc1_1', 0.9621315029521358], ['tmp_sfc4_1', 0.9604279752910915], ['tmax_2m1_1', 0.9578768560091268], ['ulwrf_s1_1', 0.9574576992187499]]
[['Columna: ulwrf_s4_1'], ['ulwrf_s5_1', 0.9963430558611763], ['tmp_sfc4_1', 0.9957411309514121], ['tmp_sfc3_1', 0.9947627885687921], ['tmp_2m_4_1', 0.9875947598774877], ['tmax_2m4_1', 0.987077087459209], ['tmax_2m5_1', 0.9863521279974694], ['tmin_2m4_1', 0.9837254094808376], ['tmp_2m_3_1', 0.9807390701133252], ['tmax_2m3_1', 0.9784038012441758], ['tmin_2m5_1', 0.9737488476761204], ['tmp_2m_5_1', 0.9695091232286572], ['ulwrf_s3_1', 0.9651706956885256], ['tmp_sfc5_1', 0.960457161005563], ['tmp_sfc2_1', 0.9572236269499521]]
[['Columna: ulwrf_s5_1'], ['tmp_sfc4_1', 0.996612200039398], ['ulwrf_s4_1', 0.9963430558611763], ['tmp_2m_4_1', 0.9926662202081515], ['tmax_2m4_1', 0.9915793553887863], ['tmax_2m5_1', 0.99151293872598], ['tmp_sfc3_1', 0.9892581590922543], ['tmin_2m4_1', 0.9881522932211056], ['tmin_2m5_1', 0.9873222874147335], ['tmp_2m_5_1', 0.9852702329370912], ['tmp_2m_3_1', 0.9850287211532148], ['tmax_2m3_1', 0.9831073628529786], ['tmp_sfc5_1', 0.9789322162596626], ['ulwrf_s3_1', 0.9734148064192735], ['tmp_sfc2_1', 0.9670339151339817], ['tmp_2m_2_1', 0.960980140611558], ['tmax_2m2_1', 0.9582268665523904]]
[['Columna: ulwrf_t1_1']]
[['Columna: ulwrf_t2_1'], ['ulwrf_t3_1', 0.9744666921198298]]
[['Columna: ulwrf_t3_1'], ['ulwrf_t2_1', 0.9744666921198298]]
[['Columna: ulwrf_t4_1'], ['ulwrf_t5_1', 0.9755542908956468]]
[['Columna: ulwrf_t5_1'], ['ulwrf_t4_1', 0.9755542908956468]]
[['Columna: uswrf_s1_1']]
[['Columna: uswrf_s2_1'], ['dswrf_s2_1', 0.9911709851006711], ['dswrf_s3_1', 0.9591814530708258]]
[['Columna: uswrf_s3_1']]
[['Columna: uswrf_s4_1'], ['uswrf_s5_1', 0.9562280634672189]]
[['Columna: uswrf_s5_1'], ['uswrf_s4_1', 0.9562280634672189]]
[['Columna: salida']]
[[['Columna: apcp_sf1_1']], [['Columna: apcp_sf2_1']], [['Columna: apcp_sf3_1']], [['Columna: apcp_sf4_1']], [['Columna: apcp_sf5_1']], [['Columna: dlwrf_s1_1'], ['dlwrf_s2_1', 0.9650067922254768], ['dlwrf_s3_1', 0.9547817730760655]], [['Columna: dlwrf_s2_1'], ['dlwrf_s3_1', 0.993701215706055], ['dlwrf_s1_1', 0.9650067922254768]], [['Columna: dlwrf_s3_1'], ['dlwrf_s2_1', 0.993701215706055], ['dlwrf_s4_1', 0.9659874690575408], ['dlwrf_s5_1', 0.9552712673845433], ['dlwrf_s1_1', 0.9547817730760655]], [['Columna: dlwrf_s4_1'], ['dlwrf_s5_1', 0.9969222914149775], ['dlwrf_s3_1', 0.9659874690575408]], [['Columna: dlwrf_s5_1'], ['dlwrf_s4_1', 0.9969222914149775], ['dlwrf_s3_1', 0.9552712673845433]], [['Columna: dswrf_s1_1']], [['Columna: dswrf_s2_1'], ['uswrf_s2_1', 0.9911709851006711], ['dswrf_s3_1', 0.9503896354343679]], [['Columna: dswrf_s3_1'], ['uswrf_s2_1', 0.9591814530708258], ['dswrf_s2_1', 0.9503896354343679]], [['Columna: dswrf_s4_1'], ['dswrf_s5_1', 0.982758557897581]], [['Columna: dswrf_s5_1'], ['dswrf_s4_1', 0.982758557897581]], [['Columna: pres_ms1_1'], ['pres_ms2_1', 0.9879236602955379], ['pres_ms3_1', 0.956852960202746]], [['Columna: pres_ms2_1'], ['pres_ms1_1', 0.9879236602955379], ['pres_ms3_1', 0.9869377705171734], ['pres_ms4_1', 0.9536176398645005]], [['Columna: pres_ms3_1'], ['pres_ms2_1', 0.9869377705171734], ['pres_ms4_1', 0.9866602703072012], ['pres_ms1_1', 0.956852960202746], ['pres_ms5_1', 0.9538147697170144]], [['Columna: pres_ms4_1'], ['pres_ms3_1', 0.9866602703072012], ['pres_ms5_1', 0.9851755074525863], ['pres_ms2_1', 0.9536176398645005]], [['Columna: pres_ms5_1'], ['pres_ms4_1', 0.9851755074525863], ['pres_ms3_1', 0.9538147697170144]], [['Columna: pwat_ea1_1'], ['pwat_ea2_1', 0.9859484994851248], ['pwat_ea3_1', 0.9577107162594556]], [['Columna: pwat_ea2_1'], ['pwat_ea3_1', 0.9874259658433963], ['pwat_ea1_1', 0.9859484994851248], ['pwat_ea4_1', 0.9618712300670131]], [['Columna: pwat_ea3_1'], ['pwat_ea4_1', 0.9880603787665849], ['pwat_ea2_1', 0.9874259658433963], ['pwat_ea5_1', 0.9616424908340101], ['pwat_ea1_1', 0.9577107162594556]], [['Columna: pwat_ea4_1'], ['pwat_ea3_1', 0.9880603787665849], ['pwat_ea5_1', 0.986763801908917], ['pwat_ea2_1', 0.9618712300670131]], [['Columna: pwat_ea5_1'], ['pwat_ea4_1', 0.986763801908917], ['pwat_ea3_1', 0.9616424908340101]], [['Columna: spfh_2m1_1'], ['spfh_2m2_1', 0.9742691195680059]], [['Columna: spfh_2m2_1'], ['spfh_2m3_1', 0.9846069576918387], ['spfh_2m1_1', 0.9742691195680059], ['spfh_2m4_1', 0.9600698332225309]], [['Columna: spfh_2m3_1'], ['spfh_2m4_1', 0.9891201306737782], ['spfh_2m2_1', 0.9846069576918387], ['spfh_2m5_1', 0.9771699520274281]], [['Columna: spfh_2m4_1'], ['spfh_2m5_1', 0.9904262248914517], ['spfh_2m3_1', 0.9891201306737782], ['spfh_2m2_1', 0.9600698332225309]], [['Columna: spfh_2m5_1'], ['spfh_2m4_1', 0.9904262248914517], ['spfh_2m3_1', 0.9771699520274281]], [['Columna: tcdc_ea1_1'], ['tcolc_e1_1', 0.9999826963362115]], [['Columna: tcdc_ea2_1'], ['tcolc_e2_1', 0.9999837132775715]], [['Columna: tcdc_ea3_1'], ['tcolc_e3_1', 0.9999845616560729]], [['Columna: tcdc_ea4_1'], ['tcolc_e4_1', 0.999984785893167]], [['Columna: tcdc_ea5_1'], ['tcolc_e5_1', 0.9999746391911669]], [['Columna: tcolc_e1_1'], ['tcdc_ea1_1', 0.9999826963362115]], [['Columna: tcolc_e2_1'], ['tcdc_ea2_1', 0.9999837132775715]], [['Columna: tcolc_e3_1'], ['tcdc_ea3_1', 0.9999845616560729]], [['Columna: tcolc_e4_1'], ['tcdc_ea4_1', 0.999984785893167]], [['Columna: tcolc_e5_1'], ['tcdc_ea5_1', 0.9999746391911669]], [['Columna: tmax_2m1_1'], ['ulwrf_s1_1', 0.9925465923917536], ['tmin_2m1_1', 0.9864627566914622], ['tmp_2m_1_1', 0.9844648786308661], ['tmp_sfc1_1', 0.9826171735169642], ['tmin_2m2_1', 0.9794577781348043], ['tmin_2m3_1', 0.97879484323375], ['ulwrf_s2_1', 0.9707221719806952], ['tmax_2m2_1', 0.9637997824764319], ['tmp_2m_2_1', 0.9602996922677912], ['ulwrf_s3_1', 0.9578768560091268], ['tmp_sfc2_1', 0.9528200105400535]], [['Columna: tmax_2m2_1'], ['tmp_2m_2_1', 0.9993112900738094], ['tmp_sfc2_1', 0.9963595768447152], ['ulwrf_s3_1', 0.9943559569788851], ['ulwrf_s2_1', 0.9917542607726626], ['tmax_2m3_1', 0.9863610206089146], ['tmp_2m_3_1', 0.9829472636912582], ['tmin_2m2_1', 0.9819461458841241], ['tmin_2m3_1', 0.9819334574863472], ['tmin_2m4_1', 0.9790285687297058], ['tmp_2m_1_1', 0.9771397511742945], ['tmin_2m1_1', 0.9735093237780129], ['tmin_2m5_1', 0.9713195080332866], ['tmax_2m4_1', 0.9698857690834339], ['tmax_2m5_1', 0.9697541323437304], ['tmp_sfc1_1', 0.9697293778574791], ['tmp_sfc5_1', 0.9692885564165558], ['tmp_2m_5_1', 0.9675902215544175], ['tmp_2m_4_1', 0.9645870257681579], ['tmax_2m1_1', 0.9637997824764319], ['ulwrf_s1_1', 0.9606784450210313], ['ulwrf_s5_1', 0.9582268665523904], ['tmp_sfc3_1', 0.9540154794130064]], [['Columna: tmax_2m3_1'], ['tmp_2m_3_1', 0.9989979986044517], ['tmin_2m4_1', 0.9973034556239585], ['tmax_2m4_1', 0.9937711137699543], ['tmax_2m5_1', 0.9931925682521422], ['tmp_2m_4_1', 0.9902875444682745], ['ulwrf_s3_1', 0.9882838798756105], ['tmp_2m_2_1', 0.9880399268817291], ['tmp_sfc2_1', 0.9874628996347063], ['tmax_2m2_1', 0.9863610206089146], ['tmin_2m5_1', 0.9849440628695366], ['tmp_sfc3_1', 0.9836268315982332], ['ulwrf_s5_1', 0.9831073628529786], ['tmp_2m_5_1', 0.9800734262132967], ['ulwrf_s4_1', 0.9784038012441758], ['tmp_sfc4_1', 0.977121003001589], ['tmp_sfc5_1', 0.9751782371847001], ['ulwrf_s2_1', 0.966905126635995], ['tmin_2m3_1', 0.9565532246066281], ['tmin_2m2_1', 0.9554925734198499]], [['Columna: tmax_2m4_1'], ['tmax_2m5_1', 0.999855391919824], ['tmp_2m_4_1', 0.9989670084606509], ['tmin_2m4_1', 0.9965267117336937], ['tmp_2m_3_1', 0.9954824569443925], ['tmax_2m3_1', 0.9937711137699543], ['ulwrf_s5_1', 0.9915793553887863], ['tmp_sfc4_1', 0.9903005164322048], ['tmin_2m5_1', 0.9888950622930037], ['ulwrf_s4_1', 0.987077087459209], ['tmp_2m_5_1', 0.9862698111992504], ['tmp_sfc3_1', 0.9857825429301692], ['tmp_sfc5_1', 0.9780961580272722], ['ulwrf_s3_1', 0.975826585247475], ['tmp_sfc2_1', 0.9734162968894661], ['tmp_2m_2_1', 0.9730783300205592], ['tmax_2m2_1', 0.9698857690834339]], [['Columna: tmax_2m5_1'], ['tmax_2m4_1', 0.999855391919824], ['tmp_2m_4_1', 0.9988753498975144], ['tmin_2m4_1', 0.9959215026917639], ['tmp_2m_3_1', 0.9948623637802105], ['tmax_2m3_1', 0.9931925682521422], ['ulwrf_s5_1', 0.99151293872598], ['tmp_sfc4_1', 0.9902303273948138], ['tmin_2m5_1', 0.9891405894645761], ['tmp_2m_5_1', 0.9872536923433799], ['ulwrf_s4_1', 0.9863521279974694], ['tmp_sfc3_1', 0.9848432369308402], ['tmp_sfc5_1', 0.9791492889290042], ['ulwrf_s3_1', 0.975536953838263], ['tmp_sfc2_1', 0.9732760733014082], ['tmp_2m_2_1', 0.9729567511523166], ['tmax_2m2_1', 0.9697541323437304]], [['Columna: tmin_2m1_1'], ['tmp_2m_1_1', 0.9981185076252812], ['tmp_sfc1_1', 0.9963142972792025], ['tmin_2m2_1', 0.9950949108277914], ['tmin_2m3_1', 0.9943408586923532], ['ulwrf_s1_1', 0.9926165091998508], ['tmax_2m1_1', 0.9864627566914622], ['ulwrf_s2_1', 0.9814254257054185], ['tmax_2m2_1', 0.9735093237780129], ['tmp_2m_2_1', 0.9708467193246784], ['ulwrf_s3_1', 0.9654648681213764], ['tmp_sfc2_1', 0.9605896048258724]], [['Columna: tmin_2m2_1'], ['tmin_2m3_1', 0.9996572623026859], ['tmp_2m_1_1', 0.9981819106379802], ['tmp_sfc1_1', 0.9960749292152973], ['tmin_2m1_1', 0.9950949108277914], ['ulwrf_s2_1', 0.9899721168841563], ['ulwrf_s1_1', 0.9850580289147287], ['tmax_2m2_1', 0.9819461458841241], ['tmp_2m_2_1', 0.9808603480091717], ['tmax_2m1_1', 0.9794577781348043], ['ulwrf_s3_1', 0.9743529570168544], ['tmp_sfc2_1', 0.9709866438853475], ['tmax_2m3_1', 0.9554925734198499], ['tmp_2m_3_1', 0.9516543862813964]], [['Columna: tmin_2m3_1'], ['tmin_2m2_1', 0.9996572623026859], ['tmp_2m_1_1', 0.9973859703817012], ['tmp_sfc1_1', 0.9951903927420194], ['tmin_2m1_1', 0.9943408586923532], ['ulwrf_s2_1', 0.9897318251363909], ['ulwrf_s1_1', 0.9843176374301973], ['tmax_2m2_1', 0.9819334574863472], ['tmp_2m_2_1', 0.9813049765366397], ['tmax_2m1_1', 0.97879484323375], ['ulwrf_s3_1', 0.9751256747221002], ['tmp_sfc2_1', 0.9715874768684567], ['tmax_2m3_1', 0.9565532246066281], ['tmp_2m_3_1', 0.9537926818714694]], [['Columna: tmin_2m4_1'], ['tmp_2m_3_1', 0.9989480053798309], ['tmax_2m3_1', 0.9973034556239585], ['tmax_2m4_1', 0.9965267117336937], ['tmax_2m5_1', 0.9959215026917639], ['tmp_2m_4_1', 0.9955546180266533], ['tmin_2m5_1', 0.9884708317878447], ['ulwrf_s5_1', 0.9881522932211056], ['tmp_sfc3_1', 0.985897294709102], ['tmp_sfc4_1', 0.9839686317112281], ['ulwrf_s4_1', 0.9837254094808376], ['tmp_2m_5_1', 0.9836800034039297], ['ulwrf_s3_1', 0.9834305304179148], ['tmp_2m_2_1', 0.9819192741530762], ['tmp_sfc2_1', 0.9818609853360799], ['tmax_2m2_1', 0.9790285687297058], ['tmp_sfc5_1', 0.9773521584222015], ['ulwrf_s2_1', 0.9587402415390521]], [['Columna: tmin_2m5_1'], ['tmp_2m_5_1', 0.9985248666465218], ['tmp_sfc5_1', 0.9960054258735697], ['tmp_2m_4_1', 0.9895327574093381], ['tmax_2m5_1', 0.9891405894645761], ['tmax_2m4_1', 0.9888950622930037], ['tmin_2m4_1', 0.9884708317878447], ['ulwrf_s5_1', 0.9873222874147335], ['tmp_2m_3_1', 0.9863630955598195], ['tmax_2m3_1', 0.9849440628695366], ['tmp_sfc4_1', 0.9799245981499726], ['ulwrf_s3_1', 0.9771636437810954], ['tmp_sfc2_1', 0.975307217691161], ['tmp_2m_2_1', 0.973997775172227], ['ulwrf_s4_1', 0.9737488476761204], ['tmax_2m2_1', 0.9713195080332866], ['tmp_sfc3_1', 0.9704219757785946], ['ulwrf_s2_1', 0.9584808725553905]], [['Columna: tmp_2m_1_1'], ['tmin_2m2_1', 0.9981819106379802], ['tmin_2m1_1', 0.9981185076252812], ['tmp_sfc1_1', 0.9979448017549044], ['tmin_2m3_1', 0.9973859703817012], ['ulwrf_s1_1', 0.9897625110599417], ['ulwrf_s2_1', 0.9853437253651708], ['tmax_2m1_1', 0.9844648786308661], ['tmax_2m2_1', 0.9771397511742945], ['tmp_2m_2_1', 0.9744909003873826], ['ulwrf_s3_1', 0.968308266178652], ['tmp_sfc2_1', 0.9638593940360212]], [['Columna: tmp_2m_2_1'], ['tmax_2m2_1', 0.9993112900738094], ['tmp_sfc2_1', 0.9971658992806404], ['ulwrf_s3_1', 0.994870585784455], ['ulwrf_s2_1', 0.990609269111517], ['tmax_2m3_1', 0.9880399268817291], ['tmp_2m_3_1', 0.9856921834337115], ['tmin_2m4_1', 0.9819192741530762], ['tmin_2m3_1', 0.9813049765366397], ['tmin_2m2_1', 0.9808603480091717], ['tmp_2m_1_1', 0.9744909003873826], ['tmin_2m5_1', 0.973997775172227], ['tmax_2m4_1', 0.9730783300205592], ['tmax_2m5_1', 0.9729567511523166], ['tmp_sfc5_1', 0.9714528776518785], ['tmin_2m1_1', 0.9708467193246784], ['tmp_2m_5_1', 0.9702697100229983], ['tmp_2m_4_1', 0.9679046544965666], ['tmp_sfc1_1', 0.9668200402557023], ['ulwrf_s5_1', 0.960980140611558], ['tmax_2m1_1', 0.9602996922677912], ['ulwrf_s1_1', 0.9573813147342563], ['tmp_sfc3_1', 0.9570765900532514]], [['Columna: tmp_2m_3_1'], ['tmax_2m3_1', 0.9989979986044517], ['tmin_2m4_1', 0.9989480053798309], ['tmax_2m4_1', 0.9954824569443925], ['tmax_2m5_1', 0.9948623637802105], ['tmp_2m_4_1', 0.9925508354894933], ['ulwrf_s3_1', 0.986475159576972], ['tmin_2m5_1', 0.9863630955598195], ['tmp_2m_2_1', 0.9856921834337115], ['tmp_sfc3_1', 0.9856020769129077], ['tmp_sfc2_1', 0.9853721430018374], ['ulwrf_s5_1', 0.9850287211532148], ['tmax_2m2_1', 0.9829472636912582], ['tmp_2m_5_1', 0.9813875596975576], ['ulwrf_s4_1', 0.9807390701133252], ['tmp_sfc4_1', 0.9796642142509698], ['tmp_sfc5_1', 0.9758811197330793], ['ulwrf_s2_1', 0.9631310421561378], ['tmin_2m3_1', 0.9537926818714694], ['tmin_2m2_1', 0.9516543862813964]], [['Columna: tmp_2m_4_1'], ['tmax_2m4_1', 0.9989670084606509], ['tmax_2m5_1', 0.9988753498975144], ['tmin_2m4_1', 0.9955546180266533], ['ulwrf_s5_1', 0.9926662202081515], ['tmp_2m_3_1', 0.9925508354894933], ['tmp_sfc4_1', 0.9924835965861123], ['tmax_2m3_1', 0.9902875444682745], ['tmin_2m5_1', 0.9895327574093381], ['ulwrf_s4_1', 0.9875947598774877], ['tmp_2m_5_1', 0.9871404965220815], ['tmp_sfc3_1', 0.9838690005793208], ['tmp_sfc5_1', 0.9781934487046426], ['ulwrf_s3_1', 0.9712226930987012], ['tmp_sfc2_1', 0.9684774894714101], ['tmp_2m_2_1', 0.9679046544965666], ['tmax_2m2_1', 0.9645870257681579]], [['Columna: tmp_2m_5_1'], ['tmin_2m5_1', 0.9985248666465218], ['tmp_sfc5_1', 0.9980075740584928], ['tmax_2m5_1', 0.9872536923433799], ['tmp_2m_4_1', 0.9871404965220815], ['tmax_2m4_1', 0.9862698111992504], ['ulwrf_s5_1', 0.9852702329370912], ['tmin_2m4_1', 0.9836800034039297], ['tmp_2m_3_1', 0.9813875596975576], ['tmax_2m3_1', 0.9800734262132967], ['tmp_sfc4_1', 0.9781233238955636], ['ulwrf_s3_1', 0.9731828651939985], ['tmp_sfc2_1', 0.971798288488127], ['tmp_2m_2_1', 0.9702697100229983], ['ulwrf_s4_1', 0.9695091232286572], ['tmax_2m2_1', 0.9675902215544175], ['tmp_sfc3_1', 0.9646544198978636], ['ulwrf_s2_1', 0.9560882131646383]], [['Columna: tmp_sfc1_1'], ['tmp_2m_1_1', 0.9979448017549044], ['tmin_2m1_1', 0.9963142972792025], ['tmin_2m2_1', 0.9960749292152973], ['tmin_2m3_1', 0.9951903927420194], ['ulwrf_s1_1', 0.9919281097281679], ['ulwrf_s2_1', 0.9842676480547997], ['tmax_2m1_1', 0.9826171735169642], ['tmax_2m2_1', 0.9697293778574791], ['tmp_2m_2_1', 0.9668200402557023], ['ulwrf_s3_1', 0.9621315029521358], ['tmp_sfc2_1', 0.957188784436685]], [['Columna: tmp_sfc2_1'], ['tmp_2m_2_1', 0.9971658992806404], ['ulwrf_s3_1', 0.996855293888584], ['tmax_2m2_1', 0.9963595768447152], ['ulwrf_s2_1', 0.9879322271553737], ['tmax_2m3_1', 0.9874628996347063], ['tmp_2m_3_1', 0.9853721430018374], ['tmin_2m4_1', 0.9818609853360799], ['tmin_2m5_1', 0.975307217691161], ['tmp_sfc5_1', 0.9740593607013978], ['tmax_2m4_1', 0.9734162968894661], ['tmax_2m5_1', 0.9732760733014082], ['tmp_2m_5_1', 0.971798288488127], ['tmin_2m3_1', 0.9715874768684567], ['tmin_2m2_1', 0.9709866438853475], ['tmp_2m_4_1', 0.9684774894714101], ['ulwrf_s5_1', 0.9670339151339817], ['tmp_sfc3_1', 0.9649823936827109], ['tmp_2m_1_1', 0.9638593940360212], ['tmin_2m1_1', 0.9605896048258724], ['ulwrf_s4_1', 0.9572236269499521], ['tmp_sfc1_1', 0.957188784436685], ['tmp_sfc4_1', 0.9555363809053586], ['tmax_2m1_1', 0.9528200105400535]], [['Columna: tmp_sfc3_1'], ['ulwrf_s4_1', 0.9947627885687921], ['ulwrf_s5_1', 0.9892581590922543], ['tmp_sfc4_1', 0.9884745475700606], ['tmin_2m4_1', 0.985897294709102], ['tmax_2m4_1', 0.9857825429301692], ['tmp_2m_3_1', 0.9856020769129077], ['tmax_2m5_1', 0.9848432369308402], ['tmp_2m_4_1', 0.9838690005793208], ['tmax_2m3_1', 0.9836268315982332], ['tmin_2m5_1', 0.9704219757785946], ['ulwrf_s3_1', 0.9698096931439735], ['tmp_sfc2_1', 0.9649823936827109], ['tmp_2m_5_1', 0.9646544198978636], ['tmp_2m_2_1', 0.9570765900532514], ['tmp_sfc5_1', 0.9562755066656142], ['tmax_2m2_1', 0.9540154794130064]], [['Columna: tmp_sfc4_1'], ['ulwrf_s5_1', 0.996612200039398], ['ulwrf_s4_1', 0.9957411309514121], ['tmp_2m_4_1', 0.9924835965861123], ['tmax_2m4_1', 0.9903005164322048], ['tmax_2m5_1', 0.9902303273948138], ['tmp_sfc3_1', 0.9884745475700606], ['tmin_2m4_1', 0.9839686317112281], ['tmin_2m5_1', 0.9799245981499726], ['tmp_2m_3_1', 0.9796642142509698], ['tmp_2m_5_1', 0.9781233238955636], ['tmax_2m3_1', 0.977121003001589], ['tmp_sfc5_1', 0.9687614236330414], ['ulwrf_s3_1', 0.9604279752910915], ['tmp_sfc2_1', 0.9555363809053586]], [['Columna: tmp_sfc5_1'], ['tmp_2m_5_1', 0.9980075740584928], ['tmin_2m5_1', 0.9960054258735697], ['tmax_2m5_1', 0.9791492889290042], ['ulwrf_s5_1', 0.9789322162596626], ['tmp_2m_4_1', 0.9781934487046426], ['tmax_2m4_1', 0.9780961580272722], ['tmin_2m4_1', 0.9773521584222015], ['tmp_2m_3_1', 0.9758811197330793], ['tmax_2m3_1', 0.9751782371847001], ['ulwrf_s3_1', 0.9744478191517459], ['tmp_sfc2_1', 0.9740593607013978], ['tmp_2m_2_1', 0.9714528776518785], ['tmax_2m2_1', 0.9692885564165558], ['tmp_sfc4_1', 0.9687614236330414], ['ulwrf_s2_1', 0.9619766387479004], ['ulwrf_s4_1', 0.960457161005563], ['tmp_sfc3_1', 0.9562755066656142]], [['Columna: ulwrf_s1_1'], ['tmin_2m1_1', 0.9926165091998508], ['tmax_2m1_1', 0.9925465923917536], ['tmp_sfc1_1', 0.9919281097281679], ['tmp_2m_1_1', 0.9897625110599417], ['tmin_2m2_1', 0.9850580289147287], ['tmin_2m3_1', 0.9843176374301973], ['ulwrf_s2_1', 0.9762674995860944], ['tmax_2m2_1', 0.9606784450210313], ['ulwrf_s3_1', 0.9574576992187499], ['tmp_2m_2_1', 0.9573813147342563]], [['Columna: ulwrf_s2_1'], ['tmax_2m2_1', 0.9917542607726626], ['tmp_2m_2_1', 0.990609269111517], ['ulwrf_s3_1', 0.990564307766117], ['tmin_2m2_1', 0.9899721168841563], ['tmin_2m3_1', 0.9897318251363909], ['tmp_sfc2_1', 0.9879322271553737], ['tmp_2m_1_1', 0.9853437253651708], ['tmp_sfc1_1', 0.9842676480547997], ['tmin_2m1_1', 0.9814254257054185], ['ulwrf_s1_1', 0.9762674995860944], ['tmax_2m1_1', 0.9707221719806952], ['tmax_2m3_1', 0.966905126635995], ['tmp_2m_3_1', 0.9631310421561378], ['tmp_sfc5_1', 0.9619766387479004], ['tmin_2m4_1', 0.9587402415390521], ['tmin_2m5_1', 0.9584808725553905], ['tmp_2m_5_1', 0.9560882131646383]], [['Columna: ulwrf_s3_1'], ['tmp_sfc2_1', 0.996855293888584], ['tmp_2m_2_1', 0.994870585784455], ['tmax_2m2_1', 0.9943559569788851], ['ulwrf_s2_1', 0.990564307766117], ['tmax_2m3_1', 0.9882838798756105], ['tmp_2m_3_1', 0.986475159576972], ['tmin_2m4_1', 0.9834305304179148], ['tmin_2m5_1', 0.9771636437810954], ['tmax_2m4_1', 0.975826585247475], ['tmax_2m5_1', 0.975536953838263], ['tmin_2m3_1', 0.9751256747221002], ['tmp_sfc5_1', 0.9744478191517459], ['tmin_2m2_1', 0.9743529570168544], ['ulwrf_s5_1', 0.9734148064192735], ['tmp_2m_5_1', 0.9731828651939985], ['tmp_2m_4_1', 0.9712226930987012], ['tmp_sfc3_1', 0.9698096931439735], ['tmp_2m_1_1', 0.968308266178652], ['tmin_2m1_1', 0.9654648681213764], ['ulwrf_s4_1', 0.9651706956885256], ['tmp_sfc1_1', 0.9621315029521358], ['tmp_sfc4_1', 0.9604279752910915], ['tmax_2m1_1', 0.9578768560091268], ['ulwrf_s1_1', 0.9574576992187499]], [['Columna: ulwrf_s4_1'], ['ulwrf_s5_1', 0.9963430558611763], ['tmp_sfc4_1', 0.9957411309514121], ['tmp_sfc3_1', 0.9947627885687921], ['tmp_2m_4_1', 0.9875947598774877], ['tmax_2m4_1', 0.987077087459209], ['tmax_2m5_1', 0.9863521279974694], ['tmin_2m4_1', 0.9837254094808376], ['tmp_2m_3_1', 0.9807390701133252], ['tmax_2m3_1', 0.9784038012441758], ['tmin_2m5_1', 0.9737488476761204], ['tmp_2m_5_1', 0.9695091232286572], ['ulwrf_s3_1', 0.9651706956885256], ['tmp_sfc5_1', 0.960457161005563], ['tmp_sfc2_1', 0.9572236269499521]], [['Columna: ulwrf_s5_1'], ['tmp_sfc4_1', 0.996612200039398], ['ulwrf_s4_1', 0.9963430558611763], ['tmp_2m_4_1', 0.9926662202081515], ['tmax_2m4_1', 0.9915793553887863], ['tmax_2m5_1', 0.99151293872598], ['tmp_sfc3_1', 0.9892581590922543], ['tmin_2m4_1', 0.9881522932211056], ['tmin_2m5_1', 0.9873222874147335], ['tmp_2m_5_1', 0.9852702329370912], ['tmp_2m_3_1', 0.9850287211532148], ['tmax_2m3_1', 0.9831073628529786], ['tmp_sfc5_1', 0.9789322162596626], ['ulwrf_s3_1', 0.9734148064192735], ['tmp_sfc2_1', 0.9670339151339817], ['tmp_2m_2_1', 0.960980140611558], ['tmax_2m2_1', 0.9582268665523904]], [['Columna: ulwrf_t1_1']], [['Columna: ulwrf_t2_1'], ['ulwrf_t3_1', 0.9744666921198298]], [['Columna: ulwrf_t3_1'], ['ulwrf_t2_1', 0.9744666921198298]], [['Columna: ulwrf_t4_1'], ['ulwrf_t5_1', 0.9755542908956468]], [['Columna: ulwrf_t5_1'], ['ulwrf_t4_1', 0.9755542908956468]], [['Columna: uswrf_s1_1']], [['Columna: uswrf_s2_1'], ['dswrf_s2_1', 0.9911709851006711], ['dswrf_s3_1', 0.9591814530708258]], [['Columna: uswrf_s3_1']], [['Columna: uswrf_s4_1'], ['uswrf_s5_1', 0.9562280634672189]], [['Columna: uswrf_s5_1'], ['uswrf_s4_1', 0.9562280634672189]], [['Columna: salida']]]

Correlation Heat Map¶

In [98]:
""" seaborne Correlation Heat Map """
# It needs to show all the columns
fig, ax = plt.subplots(figsize=(19, 18))

plt.title("Correlation Heat Map", y=1)
# We use blue color scale because it is easier to see the annotations and the correlation values
sns.heatmap(
    correlation,
    square=True,
    cmap="Blues",
    annot=True,
    fmt=".2f",
    annot_kws={"size": 4},
    cbar_kws={"shrink": 0.5},
    vmin=0.0,
    vmax=1,
)
# We can modify vmax=0.95 in order to get same color scale for values with more than 0.95 correlation
# Note: it delays around 15 seconds as it needs to plot a 76*76 matrix with its 5766 correlation values

# Exporting image as png to ../data/img folder - easier to visualize the annotations, better resolution
plt.savefig("../data/img/correlation_heatmap.png", dpi=200)

We can observe clearly how there are a lot of correlations between the different attributes, which is expected as they are all weather related variables.
This is important to know as it will allow us to know which attributes are redundant and which ones are not, so that we can delete the redundant ones in order to improve the model.

Once obtained the most correlated columns of the dataset, we can plot them and visualize their correlation.

In [99]:
# 1
columns = ['apcp_sf1_1', 'apcp_sf2_1', 'apcp_sf3_1','apcp_sf4_1', 'apcp_sf5_1']

sns.pairplot(train_df[columns], height = 1, kind ='scatter',diag_kind='kde')
plt.show()
In [100]:
# 2
columns = [ 'dlwrf_s1_1', 'dlwrf_s2_1', 'dlwrf_s3_1', 'dlwrf_s4_1', 'dlwrf_s5_1']

sns.pairplot(train_df[columns], height = 1, kind ='scatter',diag_kind='kde')
plt.show()
In [101]:
# 3
columns = ['pwat_ea1_1', 'pwat_ea2_1','pwat_ea3_1','pwat_ea4_1','pwat_ea5_1', 'dlwrf_s1_1', 'dlwrf_s2_1', 'dlwrf_s3_1', 'dlwrf_s4_1', 'dlwrf_s5_1']

sns.pairplot(train_df[columns], height = 1, kind ='scatter',diag_kind='kde')
plt.show()
In [102]:
# 4
columns = ['dswrf_s1_1', 'dswrf_s2_1', 'dswrf_s3_1', 'dswrf_s4_1', 'dswrf_s5_1']

sns.pairplot(train_df[columns], height = 1, kind ='scatter',diag_kind='kde')
plt.show()
In [103]:
# 5
columns = ['dswrf_s1_1', 'dswrf_s2_1', 'dswrf_s3_1', 'dswrf_s4_1', 'dswrf_s5_1', 'uswrf_s1_1', 'uswrf_s2_1', 'uswrf_s3_1', 'uswrf_s4_1', 'uswrf_s5_1']

sns.pairplot(train_df[columns], height = 1, kind ='scatter',diag_kind='kde')
plt.show()
In [104]:
# 6
columns = ['pres_ms1_1', 'pres_ms2_1', 'pres_ms3_1', 'pres_ms4_1', 'pres_ms5_1']

sns.pairplot(train_df[columns], height = 1, kind ='scatter',diag_kind='kde')
plt.show()
In [105]:
# 7
columns = ['pwat_ea1_1', 'pwat_ea2_1','pwat_ea3_1','pwat_ea4_1','pwat_ea5_1', 'spfh_2m1_1', 'spfh_2m2_1', 'spfh_2m3_1', 'spfh_2m4_1', 'spfh_2m5_1']

sns.pairplot(train_df[columns], height = 1, kind ='scatter',diag_kind='kde')
plt.show()
In [106]:
# 8
columns = ['spfh_2m1_1', 'spfh_2m2_1', 'spfh_2m3_1', 'spfh_2m4_1', 'spfh_2m5_1','ulwrf_s1_1', 'ulwrf_s2_1', 'ulwrf_s3_1', 'ulwrf_s4_1', 'ulwrf_s5_1']

sns.pairplot(train_df[columns], height = 1, kind ='scatter',diag_kind='kde')
plt.show()
In [107]:
# 9
columns = ['tmax_2m1_1', 'tmax_2m2_1', 'tmax_2m3_1', 'tmax_2m4_1', 'tmax_2m5_1', 'tmin_2m1_1', 'tmin_2m2_1', 'tmin_2m3_1', 'tmin_2m4_1', 'tmin_2m5_1','tmp_2m_1_1', 'tmp_2m_2_1', 'tmp_2m_3_1', 'tmp_2m_4_1', 'tmp_2m_5_1', 'tmp_sfc1_1', 'tmp_sfc2_1', 'tmp_sfc3_1', 'tmp_sfc4_1', 'tmp_sfc5_1','ulwrf_s1_1', 'ulwrf_s2_1', 'ulwrf_s3_1', 'ulwrf_s4_1', 'ulwrf_s5_1']

sns.pairplot(train_df[columns], height = 1 ,kind ='scatter',diag_kind='kde')
plt.show()
In [108]:
# 10
columns = ["ulwrf_t1_1", "ulwrf_t2_1", "ulwrf_t3_1"]

sns.pairplot(train_df[columns], height=1, kind="scatter", diag_kind="kde")
plt.show()
In [109]:
# 11
columns = ['ulwrf_t4_1', 'ulwrf_t5_1', ]

sns.pairplot(train_df[columns], height = 1 ,kind ='scatter',diag_kind='kde')
plt.show()
In [110]:
# 12
columns = ["uswrf_s2_1", "uswrf_s3_1", "uswrf_s4_1", "uswrf_s5_1"]

sns.pairplot(train_df[columns], height=1, kind="scatter", diag_kind="kde")
plt.show()

In the graphs above, we can observe that the most correlated variables exhibit a linear (and non linear) relationship between them and with the output. This is evident in the diagonal pattern in the graph, indicating that both variables increase or decrease together.

As we have previously mentioned, this is expected as the variables are all weather-related, such as radiative waves, rain, and clouds. It is normal for them to exhibit correlation at different times of the day within the same day, and it is important to consider this when creating the model and eliminating redundant variables as highly correlated variables provide redundant information and can negatively impact model performance. By identifying and removing redundant variables, the model becomes more focused, interpretable, and less prone to overfitting.


4. Train-Test division¶

Since we are working with a time dependent data, we need to avoid mixing it. Also, we are required to add the first 10 years of data to the train set and the last 2 years to the test set. This means we are assigning a 83.333333 percent of the data to train and a 16.66666666 to test.

Note: This division was already done before the EDA. We overwrite it to start from a clean state.

Note: iloc is useful when we want to split data based on the index or other criteria, while train_test_split is useful when wanting to randomly split data into training and testing subsets.
Therefore, we will use iloc to split the data into train and test sets as we are dealing with time dependent data.

In [111]:
import time
import matplotlib.pyplot as plt

# Import the metrics from sklearn
from sklearn.metrics import mean_squared_error, mean_absolute_error

from sklearn.pipeline import Pipeline

# As we have noted during the EDA, for this dataset full of outliers, its preferable to use the RobustScaler
# Although this wont make a huge difference
from sklearn.preprocessing import StandardScaler, RobustScaler

from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import TimeSeriesSplit, RandomizedSearchCV, GridSearchCV

4.1. Train-Test split¶

In [112]:
""" Train Test Split (time series) """

np.random.seed(10)

# * Make a copy of the dataframe (as Padas dataframe is mutable, therefore uses a reference)
disp_df_copy = disp_df.copy()

# print(disp_df)
# print(disp_df_copy)

# Now we make the train_x, train_y, test_x, test_y splits taking into account the time series
# Note: the time series is ordered by date, therefore we need to split the data in a way that the train data is before the test data
# Note: the 10 first years are used for training and the last two years for testing
# Note: this is done because if not, we will be predicting the past from the future, which leads to errors and overfitting (data leakage) in the model

# * Calculate the number of rows for training and testing
num_rows = disp_df_copy.shape[0]
num_train_rows = int(
    num_rows * 10 / 12
)  # 10 first years for training, 2 last years for testing

# * Split the data into train and test dataframes (using iloc instead of train_test_split as it picks random rows)
train_df = disp_df_copy.iloc[
    :num_train_rows, :
]  # train contains the first 10 years of rows
test_df = disp_df_copy.iloc[
    num_train_rows:, :
]  # test contains the last 2 years of rows

# Print the number of rows for each dataframe
print(f"Number of rows for training: {train_df.shape[0]}")
print(f"Number of rows for testing: {test_df.shape[0]}")

# Print the dataframes
# print(train_df), print(test_df)

# * Separate the input features and target variable for training and testing
X_train = train_df.drop("salida", axis=1)  # This is the input features for training
y_train = train_df["salida"]  # This is the target variable for training
X_test = test_df.drop("salida", axis=1)  # This is the input features for testing
y_test = test_df["salida"]  # This is the target variable for testing

# We also make a simulation of the exact 5th fold (4 for training and 1 for testing from the training data)
num_rows_train = train_df.shape[0]
num_train_rows_train = int(num_rows_train * 4 / 5)  # 4 folds for training, 1 fold for testing
train_5th_fold_train_df = train_df.iloc[
    :num_train_rows_train, :
]  # train_5th_fold_train contains the first 4 folds of rows

test_5th_fold_train_df = train_df.iloc[ 
    num_train_rows_train:, :
] # test_5th_fold_train contains the last fold of rows

# * Separate the input features and target variable for training and testing
X_train_5th_fold_train = train_5th_fold_train_df.drop("salida", axis=1)  # This is the input features for training
y_train_5th_fold_train = train_5th_fold_train_df["salida"]  # This is the target variable for training
X_test_5th_fold_train = test_5th_fold_train_df.drop("salida", axis=1)  # This is the input features for testing
y_test_5th_fold_train = test_5th_fold_train_df["salida"]  # This is the target variable for testing

print(f"Number of rows for training in the 5th fold: {train_5th_fold_train_df.shape[0]}")
print(f"Number of rows for testing in the 5th fold: {test_5th_fold_train_df.shape[0]}")

# Print the shapes of the dataframes
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape, X_train_5th_fold_train.shape, y_train_5th_fold_train.shape, X_test_5th_fold_train.shape, y_test_5th_fold_train.shape)
Number of rows for training: 3650
Number of rows for testing: 730
Number of rows for training in the 5th fold: 2920
Number of rows for testing in the 5th fold: 730
(3650, 75) (3650,) (730, 75) (730,) (2920, 75) (2920,) (730, 75) (730,)

4.2. Train-Test RMSE and MAE function¶

This function is used to get the MAE and RMSE values of the diffent models, therefore, it will also show the level of overfitting in the models. To perform this analysis, we compare the results of the training dataset from the first fold created by the time-series split, with the validation results of the same fold. Note that by using the training and validation sets, we avoid using the test set for any analysis, which is not recommended.

As with time split folds the largest one (more training data, as test is the same) is the fifth fold in our case, we will use it to compare the results of the different models and obtain the MAE and RMSE values. This way we can compare the results of the different models and see which one is the best. We could calculate and plot the MAE and RMSE values for each fold, but this would be time consuming and would not provide us with any additional information (we haves tested the different results with the different folds prior to this assumptions).

This way, we obtain a rather similar result in train and a relative aproximation of the test results, which is what we are looking for.

Note that we also add the posibility to make the predictions on test, but they will be only used in the final model, as we will see later.

In [113]:
np.random.seed(10)

def train_validation_test(m, model, score, X_train, y_train, test = False, X_test = None, y_test = None):
    
    # Train
    y_train_pred = model.predict(X_train)
    rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)
    mae_train = mean_absolute_error(y_train, y_train_pred)

    # Test
    if test:
        y_test_pred = model.predict(X_test)
        rmse_test = mean_squared_error(y_test, y_test_pred, squared=False)
        mae_test = mean_absolute_error(y_test, y_test_pred)
    
    # We retrain the model with the partial training data (4 folds) and test it with the 5th fold
    np.random.seed(10)

    m.fit(X = X_train_5th_fold_train, y = y_train_5th_fold_train)
    
    # Train in validation fold (5)
    y_train_validation_pred = m.predict(X_train_5th_fold_train)
     
    rmse_train_validation = mean_squared_error(y_train_5th_fold_train, y_train_validation_pred, squared=False)
    mae_train_validation = mean_absolute_error(y_train_5th_fold_train, y_train_validation_pred)

    # Test in validation fold (5)
    y_test_validation_pred = m.predict(X_test_5th_fold_train)
    
    rmse_test_validation = mean_squared_error(y_test_5th_fold_train, y_test_validation_pred, squared=False)
    mae_test_validation = mean_absolute_error(y_test_5th_fold_train, y_test_validation_pred)
    
    # ! Print results
    print(f"Results of the best estimator of {model.__class__.__name__}")
    print(f"NMAE in validation: {score:.2f}")
    print(f"RMSE train: {rmse_train:.2f}", f"MAE train: {mae_train:.2f}", sep=" | ")
    if test:
        print(f"RMSE test: {rmse_test:.2f}", f"MAE test: {mae_test:.2f}", sep=" | ")
    print(f"RMSE validation train: {rmse_train_validation:.2f}", f"MAE validation train: {mae_train_validation:.2f}", sep=" | ")
    print(f"RMSE validation test: {rmse_test_validation:.2f}", f"MAE validation test: {mae_test_validation:.2f}", sep=" | ")

    # ! Train
    title = f'Prediction Errors (RMSE: {rmse_train:.2f}, MAE: {mae_train:.2f})'
    scatterplot_histogram(X_train, y_train, y_train_pred, "Train", title)

    # ! Test
    if test:
        title = f'Prediction Errors (RMSE: {rmse_test:.2f}, MAE: {mae_test:.2f})'
        scatterplot_histogram(X_test, y_test, y_test_pred, "Test", title)
    
    # ! Train in validation fold (5)
    title = f'Prediction Errors (RMSE: {rmse_train_validation:.2f}, MAE: {mae_train_validation:.2f})'
    scatterplot_histogram(X_train_5th_fold_train, y_train_5th_fold_train, y_train_validation_pred, "Train in validation", title)

    # ! Test in validation fold (5)
    title = f'Prediction Errors (RMSE: {rmse_test_validation:.2f}, MAE: {mae_test_validation:.2f})'
    scatterplot_histogram(X_test_5th_fold_train, y_test_5th_fold_train, y_test_validation_pred, "Test in validation", title)
    
    if test:
        return [score, rmse_train, mae_train, rmse_train_validation, mae_train_validation, rmse_test_validation, mae_test_validation, rmse_test, mae_test,]
    return [score, rmse_train, mae_train, rmse_train_validation, mae_train_validation, rmse_test_validation, mae_test_validation]

def scatterplot_histogram (X, y, y_pred, name, title):
    # Make plots smaller to fit better on the notebook
    plt.rcParams['figure.figsize'] = [5.5, 3.5]
    
    # Train accuracy using scatter plot
    plt.plot(X.iloc[:, [0]], y, ".", label=f"{name}")
    plt.plot(X.iloc[:, [0]], y_pred, "r.", label=f"{name} prediction")
    plt.title(f"{name} scatter plot")
    plt.xlabel(f"{title}")
    plt.legend()
    plt.show()
    
    # Calculate the difference between test predictions and test data
    prediction_errors = y - y_pred
    
    # Plot the distribution of prediction errors    
    plt.hist(prediction_errors, bins=25)
    plt.xlabel(f'{title}')
    plt.ylabel('Frequency')
    plt.title(f"{name} prediction errors")
    plt.show()

4.3. Print model results¶

In [114]:
def print_results(name, model, score, time, test=False):
    print("---------------------------------------------------")
    print(f"{name} best model is:\n\n{model}")
    print("\nParameters:", model.best_params_)

    print(
        f"\nPerformance: NMAE (val): {score[0]}",
        f"RMSE train: {score[1]}",
        f"MAE train: {score[2]}",
        f"RMSE train in validation: {score[3]}",
        f"MAE train in validation: {score[4]}",
        f"RMSE test in validation: {score[5]}",
        f"MAE test in validation: {score[6]}",
        sep=" | ",
    )
    
    if test:
        print(
            f"RMSE test: {score[7]}",
            f"MAE test: {score[8]}",
            sep=" | ",
        )
        

    print(f"Execution time: {time}s")

4.4. Validation splits¶

We calculate the subsets used for training and testing in the different folds of the cross-validation.

Note: this function will not be used as we already made the fifth fold partition manually above (which is faster and does not need to be recomputed). However, it is useful to have it in case we want to use it in the future (for other folds, as it stores them all).

In [115]:
def validation_splits(model, X_train):
    dict_folds = {}
    
    for n_splits, (train_index, test_index) in enumerate(model.cv.split(X_train)):
        index = "F" + str(n_splits + 1)
        train_index_formatted = []
        test_index_formatted = []

        for i in range(len(train_index)):
            train_index_formatted.append("V" + str(int(train_index[i] + 1)))

        for i in range(len(test_index)):
            test_index_formatted.append("V" + str(int(test_index[i] + 1)))

        dict_folds[index] = [train_index_formatted, test_index_formatted]
        
    return dict_folds

Decisions for all models¶

For each possible method we have created two different models; One with predefined parameters and the second one with selected parameters. For each model we create a pipeline which includes the escaler ( except for trees and related ) and the model. Note that we have selected RobustEscaler as our scaling method since we have found several outliers in the EDA. Secondly, we duplicate this two models per method and we add the selection of attributes. Note that the model with no selection of attributes and the one with selection of attributes have a double pipeline. Is a double pipeline since we use the output of the first pipeline ( best hiper-parameters ) directly into the second pipeline in order to avoid innecesary computing cost.

We have decided to train all models in the most similar way possible in order for the results to be comparable. This way, all models with selected parameters use RandomSearch in order to avoid unnecessary computational cost while still producing good results. Secondly, we have decided to use TimeSeriesSplit, which is a useful method when working with time-related data. We also perform a cross-validation within the parameter search in order to avoid optimistic scoring for some parameters. For all models, we are using a 5-fold cross-validation. We also decided to use NMAE as our method for testing error since it provides an easy-to-understand score and reduces the weight of outliers (as observed during the EDA process).

In addition note that in order to create the predefined models we are using gridsearch with just one option in the param-grid. This help us stay consistent in the way we create and compare models, since it provides a way of using cross-validation within the function.


5. Basic methods¶

During this section, we will analyze the performance of three methods: KNN, Regression Trees, and Linear Regression. For each method, we will provide a predefined model and another model with selected hyper-parameters. Our hypothesis is that the selected models will provide better performance, while the predefined ones will be better in terms of timing.

Please note that we will be using GridSearch with only one possibility (the predefined one) for the hyper-parameter to make it easier to create the predefined models. Additionally, we have decided to use RandomSearch for the selection of the parameters as it has been shown to provide good results with much less computing required.

In [116]:
# Three dictionaries to store the results of the models
models, results, times = {}, {}, {}

5.1. KNN¶

KNN (k-Nearest Neighbors) is a non-parametric algorithm used for classification and regression. It works by finding the k closest training examples in the feature space to a new input, and assigns the output value based on the majority class among the k neighbors in the case of classification or the average of the output values in the case of regression(our case). The value of k is a hyperparameter that must be chosen before training the model.

In [117]:
from sklearn.neighbors import KNeighborsRegressor

5.1.1. KNN - Predefined parameters¶

5.1.1.1. KNN - Predefined parameters - No attribute selection¶

In [118]:
np.random.seed(10)
n_splits = 5

pipeline = Pipeline(
    [
        ("scale", RobustScaler()),
        ("model", KNeighborsRegressor()),
    ]
)

param_grid = {
    "model__n_neighbors": [5],
    "model__weights": ["uniform"],
    "model__metric": ["minkowski"],
    "model__algorithm": ["auto"],
}

model = GridSearchCV(
    pipeline,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(n_splits),
    n_jobs=-1,
)

start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()

total_time = end_time - start_time

# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train)  # We al ready did the 5th fold split at the begginning

# We obtain the different scores of the model
score = train_validation_test(
    model,
    model.best_estimator_,
    model.best_score_,
    X_train,
    y_train,
)

models["KNN_pred"] = model
results["KNN_pred"] = score
times["KNN_pred"] = total_time

print_results("KNN PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline
NMAE in validation: -3239984.25
RMSE train: 3517654.38 | MAE train: 2493007.22
RMSE validation train: 3557480.48 | MAE validation train: 2518007.16
RMSE validation test: 4152140.06 | MAE validation test: 2892257.01
---------------------------------------------------
KNN PREDEFINED PARAMETERS best model is:

GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
             estimator=Pipeline(steps=[('scale', RobustScaler()),
                                       ('model', KNeighborsRegressor())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto'],
                         'model__metric': ['minkowski'],
                         'model__n_neighbors': [5],
                         'model__weights': ['uniform']},
             scoring='neg_mean_absolute_error')

Parameters: {'model__algorithm': 'auto', 'model__metric': 'minkowski', 'model__n_neighbors': 5, 'model__weights': 'uniform'}

Performance: NMAE (val): -3239984.25 | RMSE train: 3517654.379918169 | MAE train: 2493007.2164383563 | RMSE train in validation: 3557480.484807456 | MAE train in validation: 2518007.157534247 | RMSE test in validation: 4152140.058048495 | MAE test in validation: 2892257.01369863
Execution time: 4.260819673538208s

5.1.1.2. KNN - Predefined parameters - Attribute selection¶

In [119]:
np.random.seed(10)
n_splits = 5

# Using a pipeline to scale the data and then apply the model
pipeline = Pipeline(
    [
        ("scale", RobustScaler()),
        ("select", SelectKBest(f_regression)),
        ("model", KNeighborsRegressor()),
    ]
)

param_grid = {
    "model__n_neighbors": [5],
    "model__weights": ["uniform"],
    "model__metric": ["minkowski"],
    "model__algorithm": ["auto"],
    "select__k": list(range(1, X_train.shape[1] + 1)),
}

model = GridSearchCV(
    pipeline,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(n_splits),
    n_jobs=-1,
)

start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()

# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train)  # We al ready did the 5th fold split at the begginning

# We obtain the different scores of the model
score = train_validation_test(
    model,
    model.best_estimator_,
    model.best_score_,
    X_train,
    y_train,
)

models["KNN_pred_k"] = model
results["KNN_pred_k"] = score
times["KNN_pred_k"] = total_time

print_results("KNN PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline
NMAE in validation: -2690780.41
RMSE train: 3108869.52 | MAE train: 2162755.27
RMSE validation train: 3116226.54 | MAE validation train: 2171515.71
RMSE validation test: 3775814.09 | MAE validation test: 2560118.55
---------------------------------------------------
KNN PREDEFINED PARAMETERS best model is:

GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
             estimator=Pipeline(steps=[('scale', RobustScaler()),
                                       ('select',
                                        SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
                                       ('model', KNeighborsRegressor())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto'],
                         'model__metric': ['minkowski'],
                         'model__n_neighbors': [5],
                         'model__weights': ['uniform'],
                         'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                       13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
                                       23, 24, 25, 26, 27, 28, 29, 30, ...]},
             scoring='neg_mean_absolute_error')

Parameters: {'model__algorithm': 'auto', 'model__metric': 'minkowski', 'model__n_neighbors': 5, 'model__weights': 'uniform', 'select__k': 6}

Performance: NMAE (val): -2690780.4078947366 | RMSE train: 3108869.5243311627 | MAE train: 2162755.2657534247 | RMSE train in validation: 3116226.5417679944 | MAE train in validation: 2171515.705479452 | RMSE test in validation: 3775814.085873258 | MAE test in validation: 2560118.5479452056
Execution time: 4.260819673538208s

Since the NMAE is normalized by the mean absolute error of the test set, it is expected to be different from the MAE calculated directly using the mean_absolute_error function. The NMAE is a way to evaluate the performance of a model in a cross-validation setting, while the MAE is a direct measure of the model's performance on the training set.

Therefore, as we can not use the results of RMSE nor MAE in test, we will use the NMAE scoring given in validation to select the best model (as it is a fairly correct aproximation).

5.1.2. KNN - Selected parameters¶

As seen during the EDA, we have a lot of outliers in the dataset, so we will use a Robust Scaler to scale the data, as it is more robust to outliers than the Standard Scaler or the MinMax Scaler.

In order to make the process of comparing the Selected parameters with the Predefined parameters, we will create two different models, one for each set of parameters, created one from another with the best parameters found in the previous step and a pipeline with the preprocessing steps.

Note that KNN with the parameters it selects tend to overfit to the data in this dataset, as it can be seen in the different scatterplots and results. Moreover, we can make sure that it is kind of overfitting (apart from the fact that it has an score of 0 for both MAE and RMSE in train and train validation) as the result in the validation (5th fold) test is not near that good.

5.1.2.1. KNN - Selected parameters - No attribute selection¶

For this model, as explained in the introduction of this section, the main parameter to consider is the number of neighbors. Additionally, we have identified other relevant parameters that need to be chosen:

  • metric: KNN is a distance-based model, so the way we measure distance between data points affects the results.
  • Weights (n_neighbors): We can decide whether to give equal importance to all neighbors, or if the closest neighbors should have a greater impact on the result (number of neighbors used to determine the output).
  • algorithm: Different algorithms can be used to compute the nearest neighbors, such as a brute-force approach that compares all data points, or a tree-based approach that partitions the data space. The latter is typically faster, but may not always provide the best results.
In [120]:
rmse = []
mae = []
rmse2 = []
mae2 = []

a_n_neighbors = range(1, 50, 2)
a_metric = ["euclidean", "manhattan", "minkowski", "chebyshev"]

for i in a_n_neighbors: 
    model = KNeighborsRegressor(n_neighbors=i)
    model.fit(X_train_5th_fold_train , y_train_5th_fold_train )
    y_pred = model.predict(X_test_5th_fold_train)
    rmse.append(np.sqrt(mean_squared_error(y_test_5th_fold_train , y_pred)))
    mae.append(mean_absolute_error(y_test_5th_fold_train, y_pred))
    
for i in a_metric:
    model = KNeighborsRegressor(metric=i)
    model.fit(X_train_5th_fold_train , y_train_5th_fold_train )
    y_pred = model.predict(X_test_5th_fold_train)
    rmse2.append(np.sqrt(mean_squared_error(y_test_5th_fold_train , y_pred)))
    mae2.append(mean_absolute_error(y_test_5th_fold_train, y_pred))

# Crear dos subplots, uno para cada gráfico
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(8, 12))

# Graficar RMSE vs. n_neighbors en el primer subplot
ax1.plot(list(a_n_neighbors), rmse, label="RMSE")
ax1.set_xlabel("n_neighbors")
ax1.set_ylabel("RMSE")
ax1.set_title("Gráfica de RMSE")

# Graficar MAE vs. n_neighbors en el segundo subplot
ax2.plot(list(a_n_neighbors), mae, label="MAE")
ax2.set_xlabel("n_neighbors")
ax2.set_ylabel("MAE")
ax2.set_title("Gráfica de MAE")

# Graficar RMSE vs. metric en el tercer subplot
ax3.plot(a_metric, rmse2, label="RMSE")
ax3.set_xlabel("metric")
ax3.set_ylabel("RMSE")
ax3.set_title("Gráfica de RMSE")

# Graficar MAE vs. metric en el cuarto subplot
ax4.plot(a_metric, mae2, label="MAE")
ax4.set_xlabel("metric")
ax4.set_ylabel("MAE")
ax4.set_title("Gráfica de MAE")

plt.tight_layout()
plt.rcParams['figure.figsize'] = [10, 3]
plt.show()
In [121]:
np.random.seed(10)
budget = 75
n_splits = 5

pipeline = Pipeline(
    [
        ("scaler", RobustScaler()),
        ("model", KNeighborsRegressor()),
    ]
)

param_grid = {
    "model__n_neighbors": list(range(1, 50, 2)),
    "model__weights": ["uniform", "distance"],
    "model__metric": ["euclidean", "manhattan", "minkowski", "chebyshev"],
    "model__algorithm": ["auto", "ball_tree", "kd_tree", "brute"],
}

model = RandomizedSearchCV(
    pipeline,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(
        n_splits
    ),  # TimeSeriesSplit to split the data in folds without losing the temporal order
    n_iter=budget,
    n_jobs=-1,
)

start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()

total_time = end_time - start_time

# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train)  # We al ready did the 5th fold split at the begginning

# We obtain the different scores of the model
score = train_validation_test(
    model,
    model.best_estimator_,
    model.best_score_,
    X_train,
    y_train,
)

models["KNN_select"] = model
results["KNN_select"] = score
times["KNN_select"] = total_time

print_results("KNN SELECTED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline
NMAE in validation: -2880131.56
RMSE train: 0.00 | MAE train: 0.00
RMSE validation train: 0.00 | MAE validation train: 0.00
RMSE validation test: 3732609.98 | MAE validation test: 2587777.13
---------------------------------------------------
KNN SELECTED PARAMETERS best model is:

RandomizedSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
                   estimator=Pipeline(steps=[('scaler', RobustScaler()),
                                             ('model', KNeighborsRegressor())]),
                   n_iter=75, n_jobs=-1,
                   param_distributions={'model__algorithm': ['auto',
                                                             'ball_tree',
                                                             'kd_tree',
                                                             'brute'],
                                        'model__metric': ['euclidean',
                                                          'manhattan',
                                                          'minkowski',
                                                          'chebyshev'],
                                        'model__n_neighbors': [1, 3, 5, 7, 9,
                                                               11, 13, 15, 17,
                                                               19, 21, 23, 25,
                                                               27, 29, 31, 33,
                                                               35, 37, 39, 41,
                                                               43, 45, 47, 49],
                                        'model__weights': ['uniform',
                                                           'distance']},
                   scoring='neg_mean_absolute_error')

Parameters: {'model__weights': 'distance', 'model__n_neighbors': 17, 'model__metric': 'manhattan', 'model__algorithm': 'kd_tree'}

Performance: NMAE (val): -2880131.5631625694 | RMSE train: 0.0 | MAE train: 0.0 | RMSE train in validation: 0.0 | MAE train in validation: 0.0 | RMSE test in validation: 3732609.9812009404 | MAE test in validation: 2587777.1287017944
Execution time: 6.565516233444214s

5.1.2.2. KNN - Selected parameters - Attribute selection¶

In [122]:
# Now, we will use the previously calculated best model to add the selection of attributes through the SelectKBest function in the pipeline
np.random.seed(10)
n_splits = 5


pipeline = Pipeline(
    [
        ("scaler", RobustScaler()),
        ("select", SelectKBest(f_regression)),
        ("model", KNeighborsRegressor()),
    ]
)

# Previous best model had as parameters: {'model__weights': 'distance', 'model__n_neighbors': 9, 'model__metric': 'manhattan'}
 
param_grid = {
    "model__n_neighbors": [9],
    "model__weights": ["distance"],
    "model__metric": ["manhattan"],
    "model__algorithm": ["kd_tree"],
    "select__k": list(range(1, X_train.shape[1] + 1)),
}

model = GridSearchCV(
    pipeline,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(
        n_splits
    ),
    n_jobs=-1,
)

start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()

total_time = end_time - start_time

# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train)  # We al ready did the 5th fold split at the begginning

# We obtain the different scores of the model
score = train_validation_test(
    model,
    model.best_estimator_,
    model.best_score_,
    X_train,
    y_train,
)

models["KNN_select_k"] = model
results["KNN_select_k"] = score
times["KNN_select_k"] = total_time

print_results("KNN SELECTED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline
NMAE in validation: -2603870.87
RMSE train: 1355.34 | MAE train: 31.73
RMSE validation train: 25827.04 | MAE validation train: 675.92
RMSE validation test: 3681057.75 | MAE validation test: 2483096.38
---------------------------------------------------
KNN SELECTED PARAMETERS best model is:

GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
             estimator=Pipeline(steps=[('scaler', RobustScaler()),
                                       ('select',
                                        SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
                                       ('model', KNeighborsRegressor())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['kd_tree'],
                         'model__metric': ['manhattan'],
                         'model__n_neighbors': [9],
                         'model__weights': ['distance'],
                         'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                       13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
                                       23, 24, 25, 26, 27, 28, 29, 30, ...]},
             scoring='neg_mean_absolute_error')

Parameters: {'model__algorithm': 'kd_tree', 'model__metric': 'manhattan', 'model__n_neighbors': 9, 'model__weights': 'distance', 'select__k': 6}

Performance: NMAE (val): -2603870.865432223 | RMSE train: 1355.336484531192 | MAE train: 31.726027397260275 | RMSE train in validation: 25827.044900407618 | MAE train in validation: 675.9246575342465 | RMSE test in validation: 3681057.75211333 | MAE test in validation: 2483096.382277287
Execution time: 4.645997762680054s

5.2 Regression Trees¶

Trees work by recursively partitioning the data into subsets based on the values of their features, creating a tree-like structure that maps each set of features to a predicted target value. Each node in the tree represents a feature, and each branch represents a decision rule based on the value of that feature. The goal is to split the data in a way that creates the most homogeneous subsets with respect to the target variable. Once the tree is constructed, it can be used to make predictions on new data by following the decision rules down the tree until a leaf node is reached, which contains the predicted target value.

Note: In trees (both regression trees and random forests), it is not necessary to scale the data, as the algorithm is not sensitive to the scale of the data.

In [123]:
from sklearn.tree import DecisionTreeRegressor

5.2.1. Regression Trees - Predefined parameters¶

5.2.1.1. Regression Trees - Predefined parameters - No attribute selection¶

In [124]:
np.random.seed(10)
n_splits = 5

pipeline = Pipeline(
    [
        ("model", DecisionTreeRegressor(random_state=1)),
    ]
)

param_grid = {
    "model__criterion": ["squared_error"],
    "model__max_depth": [None],
    "model__min_samples_split": [2],
    "model__max_features": [None],
}

model = GridSearchCV(
    pipeline,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(n_splits),
    n_jobs=-1,
)


start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()

total_time = end_time - start_time

# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train)  # We al ready did the 5th fold split at the begginning

# We obtain the different scores of the model
score = train_validation_test(
    model,
    model.best_estimator_,
    model.best_score_,
    X_train,
    y_train,
)

models["RegTrees_pred"] = model
results["RegTrees_pred"] = score
times["RegTrees_pred"] = total_time

print_results("REGRESSION TREES PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline
NMAE in validation: -3467149.44
RMSE train: 0.00 | MAE train: 0.00
RMSE validation train: 0.00 | MAE validation train: 0.00
RMSE validation test: 4961507.79 | MAE validation test: 3406755.21
---------------------------------------------------
REGRESSION TREES PREDEFINED PARAMETERS best model is:

GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
             estimator=Pipeline(steps=[('model',
                                        DecisionTreeRegressor(random_state=1))]),
             n_jobs=-1,
             param_grid={'model__criterion': ['squared_error'],
                         'model__max_depth': [None],
                         'model__max_features': [None],
                         'model__min_samples_split': [2]},
             scoring='neg_mean_absolute_error')

Parameters: {'model__criterion': 'squared_error', 'model__max_depth': None, 'model__max_features': None, 'model__min_samples_split': 2}

Performance: NMAE (val): -3467149.4407894737 | RMSE train: 0.0 | MAE train: 0.0 | RMSE train in validation: 0.0 | MAE train in validation: 0.0 | RMSE test in validation: 4961507.791413844 | MAE test in validation: 3406755.205479452
Execution time: 0.5705435276031494s

5.2.1.2. Regression Trees - Predefined parameters - Attribute selection¶

In [125]:
np.random.seed(10)
n_splits = 5

pipeline = Pipeline(
    [
        ('select', SelectKBest(f_regression)),
        ("model", DecisionTreeRegressor(random_state=1)),
    ]
)

param_grid = {
    "model__criterion": ["squared_error"],
    "model__max_depth": [None],
    "model__min_samples_split": [2],
    "model__max_features": [None],
    "select__k": list(range(1, X_train.shape[1] + 1)),
}

model = GridSearchCV(
    pipeline,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(n_splits),
    n_jobs=-1,
)


start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()

total_time = end_time - start_time

# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train)  # We al ready did the 5th fold split at the begginning

# We obtain the different scores of the model
score = train_validation_test(
    model,
    model.best_estimator_,
    model.best_score_,
    X_train,
    y_train,
)

models["RegTrees_pred_k"] = model
results["RegTrees_pred_k"] = score
times["RegTrees_pred_k"] = total_time

print_results("REGRESSION TREES PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline
NMAE in validation: -3328832.17
RMSE train: 0.00 | MAE train: 0.00
RMSE validation train: 0.00 | MAE validation train: 0.00
RMSE validation test: 5002502.82 | MAE validation test: 3460965.62
---------------------------------------------------
REGRESSION TREES PREDEFINED PARAMETERS best model is:

GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
             estimator=Pipeline(steps=[('select',
                                        SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
                                       ('model',
                                        DecisionTreeRegressor(random_state=1))]),
             n_jobs=-1,
             param_grid={'model__criterion': ['squared_error'],
                         'model__max_depth': [None],
                         'model__max_features': [None],
                         'model__min_samples_split': [2],
                         'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                       13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
                                       23, 24, 25, 26, 27, 28, 29, 30, ...]},
             scoring='neg_mean_absolute_error')

Parameters: {'model__criterion': 'squared_error', 'model__max_depth': None, 'model__max_features': None, 'model__min_samples_split': 2, 'select__k': 9}

Performance: NMAE (val): -3328832.171052632 | RMSE train: 0.0 | MAE train: 0.0 | RMSE train in validation: 0.0 | MAE train in validation: 0.0 | RMSE test in validation: 5002502.819275869 | MAE test in validation: 3460965.616438356
Execution time: 3.333441972732544s

Note: As we can see, the default model is clearly overfitting, as indicated by the 0 error for the train section and a high error for the test section. This is likely due to the lack of control over the maximum depth of the tree, combined with a small minimum sample split that leaves only one sample in each leaf. This causes the model to memorize each data point, leading to poor generalization performance.

5.2.2. Regression Trees - Selected parameters¶

Building upon the previous definition, we can reduce the most important parameters to be ajusted to the following:

  • max_features: controls the number of features ( or attributes ) used in the tree.
  • min_samples-split: controls the minumum number of instances a leaf must have in order to be able to subdivide. This parameter can prevent the tree from overfitting.
  • max-depth: This parammeter also helps to prevent overfitting by stoping the tree from subdividing to much.
In [126]:
rmse = []
mae = []
rmse2 = []
mae2 = []

a_max_depth = range(5, 61, 5)
a_min_samples_split = range(5, 200)

for i in a_max_depth: 
    model = DecisionTreeRegressor(random_state=1, max_depth=i)
    model.fit(X_train_5th_fold_train , y_train_5th_fold_train )
    y_pred = model.predict(X_test_5th_fold_train)
    rmse.append(np.sqrt(mean_squared_error(y_test_5th_fold_train , y_pred)))
    mae.append(mean_absolute_error(y_test_5th_fold_train, y_pred))
    
for i in a_min_samples_split:
    model = DecisionTreeRegressor(random_state=1, min_samples_split=i)
    model.fit(X_train_5th_fold_train , y_train_5th_fold_train )
    y_pred = model.predict(X_test_5th_fold_train)
    rmse2.append(np.sqrt(mean_squared_error(y_test_5th_fold_train , y_pred)))
    mae2.append(mean_absolute_error(y_test_5th_fold_train, y_pred))

# Crear dos subplots, uno para cada gráfico
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(8, 12))

# Graficar RMSE vs. n_neighbors en el primer subplot
ax1.plot(list(a_max_depth), rmse, label="RMSE")
ax1.set_xlabel("max_depth")
ax1.set_ylabel("RMSE")
ax1.set_title("Gráfica de RMSE")

# Graficar MAE vs. n_neighbors en el segundo subplot
ax2.plot(list(a_max_depth), mae, label="MAE")
ax2.set_xlabel("max_depth")
ax2.set_ylabel("MAE")
ax2.set_title("Gráfica de MAE")

# Graficar RMSE vs. metric en el tercer subplot
ax3.plot(list(a_min_samples_split), rmse2, label="RMSE")
ax3.set_xlabel("min_samples_split")
ax3.set_ylabel("RMSE")
ax3.set_title("Gráfica de RMSE")

# Graficar MAE vs. metric en el cuarto subplot
ax4.plot(list(a_min_samples_split), mae2, label="MAE")
ax4.set_xlabel("min_samples_split")
ax4.set_ylabel("MAE")
ax4.set_title("Gráfica de MAE")

plt.tight_layout()
plt.rcParams['figure.figsize'] = [10, 3]
plt.show()

5.2.2.1. Regression Trees - Selected parameters - No attribute selection¶

In [127]:
np.random.seed(10)
budget = 75
n_splits = 5

pipeline = Pipeline(
    [
        ("model", DecisionTreeRegressor(random_state=1))
    ]
)

param_grid = {
    "model__criterion": ["absolute_error", "squared_error"],
    "model__max_depth": list(range(5, 61, 5)),
    "model__min_samples_split": list(range(5, 200)),
    "model__max_features": ["sqrt", "log2", None],
}

# We use TimeSeriesSplit to split the data in folds without losing the temporal order
model = RandomizedSearchCV(
    pipeline,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(n_splits),
    n_iter=budget,
    n_jobs=-1,
)

start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()

total_time = end_time - start_time

# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train)  # We al ready did the 5th fold split at the begginning

# We obtain the different scores of the model
score = train_validation_test(
    model,
    model.best_estimator_,
    model.best_score_,
    X_train,
    y_train,
)

models["RegTrees_select"] = model
results["RegTrees_select"] = score
times["RegTrees_select"] = total_time

print_results("REGRESSION TREES SELECTED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline
NMAE in validation: -2743220.58
RMSE train: 3259190.45 | MAE train: 2080612.60
RMSE validation train: 3286556.31 | MAE validation train: 2092567.19
RMSE validation test: 3914582.59 | MAE validation test: 2655352.60
---------------------------------------------------
REGRESSION TREES SELECTED PARAMETERS best model is:

RandomizedSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
                   estimator=Pipeline(steps=[('model',
                                              DecisionTreeRegressor(random_state=1))]),
                   n_iter=75, n_jobs=-1,
                   param_distributions={'model__criterion': ['absolute_error',
                                                             'squared_error'],
                                        'model__max_depth': [5, 10, 15, 20, 25,
                                                             30, 35, 40, 45, 50,
                                                             55, 60],
                                        'model__max_features': ['sqrt', 'log2',
                                                                None],
                                        'model__min_samples_split': [5, 6, 7, 8,
                                                                     9, 10, 11,
                                                                     12, 13, 14,
                                                                     15, 16, 17,
                                                                     18, 19, 20,
                                                                     21, 22, 23,
                                                                     24, 25, 26,
                                                                     27, 28, 29,
                                                                     30, 31, 32,
                                                                     33, 34, ...]},
                   scoring='neg_mean_absolute_error')

Parameters: {'model__min_samples_split': 106, 'model__max_features': None, 'model__max_depth': 30, 'model__criterion': 'absolute_error'}

Performance: NMAE (val): -2743220.575657895 | RMSE train: 3259190.446254432 | MAE train: 2080612.602739726 | RMSE train in validation: 3286556.310045412 | MAE train in validation: 2092567.191780822 | RMSE test in validation: 3914582.5939823505 | MAE test in validation: 2655352.602739726
Execution time: 16.35970973968506s

5.2.2.2. Regression Trees - Selected parameters - Attribute selection¶

In [128]:
np.random.seed(10)
n_splits = 5

pipeline = Pipeline(
    [
        ("select", SelectKBest(f_regression)),
        ("model", DecisionTreeRegressor(random_state=1))
    ]
)

# Previous model Parameters: {'model__min_samples_split': 106, 'model__max_features': None, 'model__max_depth': 30, 'model__criterion': 'absolute_error'}

param_grid = {
    "model__criterion": ["absolute_error"],
    "model__max_depth": [30],
    "model__min_samples_split": [106],
    "model__max_features": [None],
    "select__k": list(range(1, X_train.shape[1] + 1)),
}

# We use TimeSeriesSplit to split the data in folds without losing the temporal order
model = GridSearchCV(
    pipeline,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(n_splits),
    n_jobs=-1,
)

start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()

total_time = end_time - start_time

# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train)  # We al ready did the 5th fold split at the begginning

# We obtain the different scores of the model
score = train_validation_test(
    model,
    model.best_estimator_,
    model.best_score_,
    X_train,
    y_train,
)

models["RegTrees_select_k"] = model
results["RegTrees_select_k"] = score
times["RegTrees_select_k"] = total_time

print_results("REGRESSION TREES SELECTED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline
NMAE in validation: -2727416.15
RMSE train: 3452866.62 | MAE train: 2199234.33
RMSE validation train: 3561457.96 | MAE validation train: 2280089.28
RMSE validation test: 4044668.04 | MAE validation test: 2710957.60
---------------------------------------------------
REGRESSION TREES SELECTED PARAMETERS best model is:

GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
             estimator=Pipeline(steps=[('select',
                                        SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
                                       ('model',
                                        DecisionTreeRegressor(random_state=1))]),
             n_jobs=-1,
             param_grid={'model__criterion': ['absolute_error'],
                         'model__max_depth': [30],
                         'model__max_features': [None],
                         'model__min_samples_split': [106],
                         'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                       13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
                                       23, 24, 25, 26, 27, 28, 29, 30, ...]},
             scoring='neg_mean_absolute_error')

Parameters: {'model__criterion': 'absolute_error', 'model__max_depth': 30, 'model__max_features': None, 'model__min_samples_split': 106, 'select__k': 4}

Performance: NMAE (val): -2727416.151315789 | RMSE train: 3452866.617242818 | MAE train: 2199234.328767123 | RMSE train in validation: 3561457.960699349 | MAE train in validation: 2280089.2808219176 | RMSE test in validation: 4044668.035092536 | MAE test in validation: 2710957.602739726
Execution time: 27.90829086303711s

5.3 Linnear Regression¶

Linear regression is a supervised learning algorithm that models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data. The goal is to find the best fit line that can predict the dependent variable given the independent variables.

For the selected model we will be considering Lasso and Ridge. Lasso and Ridge regression are two popular regularization techniques used with linear regression. Lasso adds a penalty term to the regression equation that encourages the model to minimize the absolute value of the regression coefficients, which can lead to some coefficients being exactly zero. Ridge regression, on the other hand, adds a penalty term that encourages the model to minimize the square of the regression coefficients, which can help prevent overfitting. These techniques can improve the performance of the linear regression model by reducing the impact of irrelevant or highly correlated features.

In [129]:
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet

5.3.1. Linear Regression - Predefined parameters¶

5.3.1.1. Linear Regression - Predefined parameters - No attribute selection¶

In [130]:
np.random.seed(10)
n_splits = 5

pipeline = Pipeline([("scaler", RobustScaler()), ("model", LinearRegression())])

param_grid = {
    "model__fit_intercept": [True],
}

model = GridSearchCV(
    pipeline,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(n_splits),
    n_jobs=-1,
)

start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()

total_time = end_time - start_time

# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train)  # We al ready did the 5th fold split at the begginning

# We obtain the different scores of the model
score = train_validation_test(
    model,
    model.best_estimator_,
    model.best_score_,
    X_train,
    y_train,
)

models["LinearReg_pred"] = model
results["LinearReg_pred"] = score
times["LinearReg_pred"] = total_time

print_results("LINEAR REGRESSION PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline
NMAE in validation: -2437056.06
RMSE train: 3254352.60 | MAE train: 2321647.06
RMSE validation train: 3265297.88 | MAE validation train: 2322380.61
RMSE validation test: 3268115.48 | MAE validation test: 2265683.80
---------------------------------------------------
LINEAR REGRESSION PREDEFINED PARAMETERS best model is:

GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
             estimator=Pipeline(steps=[('scaler', RobustScaler()),
                                       ('model', LinearRegression())]),
             n_jobs=-1, param_grid={'model__fit_intercept': [True]},
             scoring='neg_mean_absolute_error')

Parameters: {'model__fit_intercept': True}

Performance: NMAE (val): -2437056.0592061607 | RMSE train: 3254352.603690468 | MAE train: 2321647.0597032406 | RMSE train in validation: 3265297.879240584 | MAE train in validation: 2322380.6106294743 | RMSE test in validation: 3268115.4760430153 | MAE test in validation: 2265683.802964292
Execution time: 0.2956578731536865s

5.3.1.2. Linear Regression - Predefined parameters - No attribute selection¶

In [131]:
np.random.seed(10)
n_splits = 5

pipeline = Pipeline(
    [
        ("scaler", RobustScaler()),
        ("select", SelectKBest(f_regression)),
        ("model", LinearRegression()),
    ]
)

param_grid = {
    "model__fit_intercept": [True],
    "select__k": list(range(1, X_train.shape[1] + 1)),
}

model = GridSearchCV(
    pipeline,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(n_splits),
    n_jobs=-1,
)


start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()

total_time = end_time - start_time

# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train)  # We al ready did the 5th fold split at the begginning

# We obtain the different scores of the model
score = train_validation_test(
    model,
    model.best_estimator_,
    model.best_score_,
    X_train,
    y_train,
)

models["LinearReg_pred_k"] = model
results["LinearReg_pred_k"] = score
times["LinearReg_pred_k"] = total_time

print_results("LINEAR REGRESSION PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline
NMAE in validation: -2421796.65
RMSE train: 3256574.00 | MAE train: 2323171.61
RMSE validation train: 3267629.55 | MAE validation train: 2322601.75
RMSE validation test: 3267567.88 | MAE validation test: 2263068.40
---------------------------------------------------
LINEAR REGRESSION PREDEFINED PARAMETERS best model is:

GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
             estimator=Pipeline(steps=[('scaler', RobustScaler()),
                                       ('select',
                                        SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
                                       ('model', LinearRegression())]),
             n_jobs=-1,
             param_grid={'model__fit_intercept': [True],
                         'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                       13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
                                       23, 24, 25, 26, 27, 28, 29, 30, ...]},
             scoring='neg_mean_absolute_error')

Parameters: {'model__fit_intercept': True, 'select__k': 72}

Performance: NMAE (val): -2421796.652193799 | RMSE train: 3256573.9989301027 | MAE train: 2323171.6092511206 | RMSE train in validation: 3267629.5529683903 | MAE train in validation: 2322601.753096195 | RMSE test in validation: 3267567.87998712 | MAE test in validation: 2263068.4012916926
Execution time: 2.2786672115325928s

5.3.2. Linear Regression - Selected parameters¶

Expanding upon the previous discussion, when using Lasso regression, we can focus on adjusting the following key parameters:

  • alpha: This parameter determines the amount of regularization applied to the model. A higher alpha results in stronger regularization, which can help to reduce overfitting.
  • k: This parameter determines the number of features selected by the Lasso model. By adjusting k, we can control the complexity of the model and potentially improve its performance.

It's worth noting that these are just a few of the many parameters that can be adjusted when using Lasso regression. However, by focusing on these key parameters, we can gain a better understanding of how the model works and how to optimize its performance.

We can reduce the most important parameters to be adjusted for Ridge regression to the following:

  • alpha: controls the strength of the regularization penalty applied to the coefficients. A high alpha value can lead to underfitting, while a low alpha value can lead to overfitting.
  • k: the number of top features selected by the SelectKBest function. This parameter determines the number of features to be used in the model and can have an impact on its performance.

Similarly, we can reduce the most important parameters for Elastic Net regression to be adjusted to the following:

  • alpha: controls the regularization strength of both L1 and L2 penalties. A high alpha will increase the regularization strength, while a low alpha will decrease it.
  • l1_ratio: controls the ratio between L1 and L2 penalties. A l1_ratio of 1 is equivalent to Lasso regression, while a ratio of 0 is equivalent to Ridge regression.
  • k: the number of features to be selected by the SelectKBest method. This parameter is part of the pipeline and helps to select the most relevant features for the model.

Adjusting these parameters can help prevent overfitting and improve the performance of the Elastic Net regression model.

Note: due to scikit learn internal implementation of the Elastic Net model, we can obtain Lasso model by setting l1_ratio to 1 but not Ridge model by setting l1_ratio to 0, which is strange. This results in a poor performance of Elastic Net model, as for this dataset, Ridge model is way more efficient than Lasso model.
This could be occurring due to the dataset itself (outliers, correlations...), the dataset handling, the library version, the dependencies, the python version, or the virtual environment.

In [132]:
rmse = []
mae = []
rmse2 = []
mae2 = []

a_alpha = np.logspace(-2, 5, 75)

for i in a_alpha: 
    model = Lasso(fit_intercept=True, tol=0.5, random_state=10, alpha = i)
    model.fit(X_train_5th_fold_train , y_train_5th_fold_train )
    y_pred = model.predict(X_test_5th_fold_train)
    rmse.append(np.sqrt(mean_squared_error(y_test_5th_fold_train , y_pred)))
    mae.append(mean_absolute_error(y_test_5th_fold_train, y_pred))
    

# Crear dos subplots, uno para cada gráfico
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(8, 12))

# Graficar RMSE vs. n_neighbors en el primer subplot
ax1.plot(list(a_alpha), rmse, label="RMSE")
ax1.set_xlabel("alpha")
ax1.set_ylabel("RMSE")
ax1.set_title("Gráfica de RMSE")

# Graficar MAE vs. n_neighbors en el segundo subplot
ax2.plot(list(a_alpha), mae, label="MAE")
ax2.set_xlabel("alpha")
ax2.set_ylabel("MAE")
ax2.set_title("Gráfica de MAE")

plt.tight_layout()
plt.rcParams['figure.figsize'] = [10, 3]
plt.show()

5.3.2.1. Linear Regression - Selected parameters - Attribute Selection¶

In [133]:
np.random.seed(10)
budget = 75
n_splits = 5

all_scores = []

# ! Pipelines
pipeline_lasso = Pipeline(
    [
        ("scaler", RobustScaler()),
        ("model", Lasso(fit_intercept=True, tol=0.5, random_state=10)),
    ]
)

pipeline_ridge = Pipeline(
    [
        ("scaler", RobustScaler()),
        ("model", Ridge(fit_intercept=True, random_state=10)),
    ]
)

pipeline_elastic = Pipeline(
    [
        ("scaler", RobustScaler()),
        ("model", ElasticNet(fit_intercept=True, tol=0.5, random_state=10)),
    ]
)

# ! Parameter grids
param_grid_lasso = {
    "model__alpha": np.logspace(-2, 5, 75),  # Between 0.001 and 100000
}

param_grid_ridge = {
    "model__alpha": np.logspace(-2, 1, 75),  # Between 0.001 and 10
}

param_grid_elastic = {
    "model__alpha": np.logspace(-2, 5, 75),  # Between 0.001 and 10
    "model__l1_ratio": np.linspace(0, 1, 75),  # Between 0 and 1
}

# ! If we want to use random values for the parameters -> unconsistency in the results
regr_lasso = RandomizedSearchCV(
    pipeline_lasso,
    param_grid_lasso,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(),
    n_iter=budget,
    n_jobs=-1,
)

regr_ridge = RandomizedSearchCV(
    pipeline_ridge,
    param_grid_ridge,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(),
    n_iter=budget,
    n_jobs=-1,
)

regr_elastic = RandomizedSearchCV(
    pipeline_elastic,
    param_grid_elastic,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(),
    n_iter=budget,
    n_jobs=-1,
)

model = [regr_lasso, regr_ridge, regr_elastic]


ln_reg_time, scoring = [], []


for i in model:
    start_time = time.time()
    i.fit(X=X_train, y=y_train)
    print(f"Model: {i.best_score_}")
    print(i.best_params_)
    # Now we reevaluate the model on the test set to obtain more accurate results
    # Calculate the subsets used for training and testing in the different folds of the cross-validation
    # validation_folds = validation_splits(i, X_train)
    scoring.append(i.best_score_)
    all_scores.append(
        train_validation_test(
            i,
            i.best_estimator_,
            i.best_score_,
            X_train,
            y_train,
        )
    )
    ln_reg_time.append(time.time() - start_time)

print(ln_reg_time)

# Select the best model (based on the MAE)
max_score = min(
    all_scores, key=lambda x: abs(x[0])
)  # Best model is the one that minimizes the validation NMAE
best_model = model[all_scores.index(max_score)]
total_time = ln_reg_time[all_scores.index(max_score)]

models["LinearReg_select"] = best_model
results["LinearReg_select"] = max_score
times["LinearReg_select"] = total_time

# Print results
print_results("LINEAR REGRESSION SELECTED PARAMETERS", best_model, score, total_time)
Model: -2916665.224197623
{'model__alpha': 52025.49442372698}
Results of the best estimator of Pipeline
NMAE in validation: -2916665.22
RMSE train: 3876831.69 | MAE train: 2906173.48
RMSE validation train: 3922101.26 | MAE validation train: 2937813.23
RMSE validation test: 3923699.07 | MAE validation test: 2858386.71
Model: -2396352.0117066414
{'model__alpha': 0.9693631061142517}
Results of the best estimator of Pipeline
NMAE in validation: -2396352.01
RMSE train: 3276534.92 | MAE train: 2333075.68
RMSE validation train: 3292824.95 | MAE validation train: 2337932.29
RMSE validation test: 3280253.30 | MAE validation test: 2260087.83
Model: -2658838.747674453
{'model__l1_ratio': 0.43243243243243246, 'model__alpha': 0.26237286577779917}
Results of the best estimator of Pipeline
NMAE in validation: -2658838.75
RMSE train: 3592222.21 | MAE train: 2660152.84
RMSE validation train: 3604425.08 | MAE validation train: 2661450.93
RMSE validation test: 3576311.08 | MAE validation test: 2558163.00
[4.613823652267456, 4.527256488800049, 4.600627183914185]
---------------------------------------------------
LINEAR REGRESSION SELECTED PARAMETERS best model is:

RandomizedSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
                   estimator=Pipeline(steps=[('scaler', RobustScaler()),
                                             ('model',
                                              Ridge(random_state=10))]),
                   n_iter=75, n_jobs=-1,
                   param_distributions={'model__alpha': array([ 0.01      ,  0.01097844,  0.01205261,  0.01323188,  0.01452654,
        0.01594787,  0.01750827,  0.01922135,  0.02110203,  0.02316674,
        0.025...
        0.66730492,  0.73259654,  0.80427655,  0.88297   ,  0.96936311,
        1.06420924,  1.16833549,  1.28264983,  1.40814912,  1.54592774,
        1.69718713,  1.86324631,  2.04555335,  2.245698  ,  2.46542555,
        2.70665207,  2.9714811 ,  3.26222201,  3.5814101 ,  3.93182876,
        4.31653369,  4.73887961,  5.20254944,  5.71158648,  6.27042962,
        6.88395207,  7.55750387,  8.29695852,  9.1087642 , 10.        ])},
                   scoring='neg_mean_absolute_error')

Parameters: {'model__alpha': 1.0642092440647246}

Performance: NMAE (val): -2421796.652193799 | RMSE train: 3256573.9989301027 | MAE train: 2323171.6092511206 | RMSE train in validation: 3267629.5529683903 | MAE train in validation: 2322601.753096195 | RMSE test in validation: 3267567.87998712 | MAE test in validation: 2263068.4012916926
Execution time: 4.527256488800049s

5.3.2.2. Linear Regression - Selected parameters - Attribute Selection¶

In [134]:
np.random.seed(10)
n_splits = 5

# We use Ridge as model as it is the best performing one
pipeline = Pipeline(
    [
        ("scaler", RobustScaler()),
        ("select", SelectKBest(f_regression)),
        ("model", Ridge(fit_intercept=True, random_state=10)),
    ]
)

# Previous model Parameters: {'model__alpha': 0.9693631061142517}

param_grid = {
    "model__alpha": [0.9693631061142517],
    "select__k": list(range(1, X_train.shape[1] + 1)),
}

# We use TimeSeriesSplit to split the data in folds without losing the temporal order
model = GridSearchCV(
    pipeline,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(n_splits),
    n_jobs=-1,
)

start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()

total_time = end_time - start_time

# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train)  # We al ready did the 5th fold split at the begginning

# We obtain the different scores of the model
score = train_validation_test(
    model,
    model.best_estimator_,
    model.best_score_,
    X_train,
    y_train,
)

models["LinearReg_select_k"] = model
results["LinearReg_select_k"] = score
times["LinearReg_select_k"] = total_time

# Print results
print_results("LINEAR REGRESSION SELECTED PARAMETERS", best_model, score, total_time)
Results of the best estimator of Pipeline
NMAE in validation: -2389586.49
RMSE train: 3278341.47 | MAE train: 2333541.31
RMSE validation train: 3293274.52 | MAE validation train: 2336194.58
RMSE validation test: 3278610.61 | MAE validation test: 2258218.41
---------------------------------------------------
LINEAR REGRESSION SELECTED PARAMETERS best model is:

RandomizedSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
                   estimator=Pipeline(steps=[('scaler', RobustScaler()),
                                             ('model',
                                              Ridge(random_state=10))]),
                   n_iter=75, n_jobs=-1,
                   param_distributions={'model__alpha': array([ 0.01      ,  0.01097844,  0.01205261,  0.01323188,  0.01452654,
        0.01594787,  0.01750827,  0.01922135,  0.02110203,  0.02316674,
        0.025...
        0.66730492,  0.73259654,  0.80427655,  0.88297   ,  0.96936311,
        1.06420924,  1.16833549,  1.28264983,  1.40814912,  1.54592774,
        1.69718713,  1.86324631,  2.04555335,  2.245698  ,  2.46542555,
        2.70665207,  2.9714811 ,  3.26222201,  3.5814101 ,  3.93182876,
        4.31653369,  4.73887961,  5.20254944,  5.71158648,  6.27042962,
        6.88395207,  7.55750387,  8.29695852,  9.1087642 , 10.        ])},
                   scoring='neg_mean_absolute_error')

Parameters: {'model__alpha': 1.0642092440647246}

Performance: NMAE (val): -2389586.491181177 | RMSE train: 3278341.466529396 | MAE train: 2333541.305110323 | RMSE train in validation: 3293274.5203141714 | MAE train in validation: 2336194.5845998474 | RMSE test in validation: 3278610.608896576 | MAE test in validation: 2258218.4050652594
Execution time: 2.1258764266967773s

To be observed, the selected model, Ridge, does not delete any of the attributes (as expected, as it is one of its flaws), but some of their weights are close to zero, so we can consider that they are not relevant for the model.

On the other hand, the Lasso model and the ElasticNet model, do delete some of the attributes, but the results are worse than the Ridge model, so we will not consider them.

5.4. Dummy Regressor¶

As in the other models, in order to be able to compare the different times and scores, we will divide the dummy regressor into two different models. The first model creates the model without selecting the attributes and the second uses the best parameters of the previous one and selects the attributes through another pipeline.

As strategy we selected "median" as we are dealing with (N)MAE as scoring in the other methods.

In [135]:
from sklearn.dummy import DummyRegressor
In [136]:
np.random.seed(10)
n_splits = 5

pipeline = Pipeline(
    [
        ("scaler", RobustScaler()),
        ("model", DummyRegressor()),
    ]
)

param_grid = {
    'model__strategy': ['median'],
}

model = GridSearchCV(
    pipeline,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(n_splits),
    n_jobs=-1,
)

start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()

total_time = end_time - start_time

# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train)  # We al ready did the 5th fold split at the begginning

# We obtain the different scores of the model
score = train_validation_test(
    model,
    model.best_estimator_,
    model.best_score_,
    X_train,
    y_train,
)

models["DummyReg"] = model
results["DummyReg"] = score
times["DummyReg"] = total_time

print_results("RANDOM FOREST PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline
NMAE in validation: -6953359.14
RMSE train: 8058570.05 | MAE train: 6899205.37
RMSE validation train: 8120616.17 | MAE validation train: 6944040.21
RMSE validation test: 7809144.90 | MAE validation test: 6720947.26
---------------------------------------------------
RANDOM FOREST PREDEFINED PARAMETERS best model is:

GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
             estimator=Pipeline(steps=[('scaler', RobustScaler()),
                                       ('model', DummyRegressor())]),
             n_jobs=-1, param_grid={'model__strategy': ['median']},
             scoring='neg_mean_absolute_error')

Parameters: {'model__strategy': 'median'}

Performance: NMAE (val): -6953359.144736841 | RMSE train: 8058570.051086258 | MAE train: 6899205.369863014 | RMSE train in validation: 8120616.171716434 | MAE train in validation: 6944040.205479452 | RMSE test in validation: 7809144.902737563 | MAE test in validation: 6720947.2602739725
Execution time: 0.23421549797058105s

5.4.2. Dummy Regressor - Attribute selection¶

In [137]:
np.random.seed(10)
n_splits = 5

pipeline = Pipeline(
    [
        ("scaler", RobustScaler()),
        ("select", SelectKBest(f_regression)),
        ("model", DummyRegressor(strategy="median")),
    ]
)

# Previous model parameters: {'model__strategy': 'median'}

param_grid = {
    'model__strategy': ['median'],
    "select__k": list(range(1, X_train.shape[1] + 1)),
}

model = GridSearchCV(
    pipeline,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(n_splits),
    n_jobs=-1,
)

start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()

total_time = end_time - start_time

# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train)  # We al ready did the 5th fold split at the begginning

# We obtain the different scores of the model
score = train_validation_test(
    model,
    model.best_estimator_,
    model.best_score_,
    X_train,
    y_train,
)

models["DummyReg_k"] = model
results["DummyReg_k"] = score
times["DummyReg_k"] = total_time

print_results("RANDOM FOREST PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline
NMAE in validation: -6953359.14
RMSE train: 8058570.05 | MAE train: 6899205.37
RMSE validation train: 8120616.17 | MAE validation train: 6944040.21
RMSE validation test: 7809144.90 | MAE validation test: 6720947.26
---------------------------------------------------
RANDOM FOREST PREDEFINED PARAMETERS best model is:

GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
             estimator=Pipeline(steps=[('scaler', RobustScaler()),
                                       ('select',
                                        SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
                                       ('model',
                                        DummyRegressor(strategy='median'))]),
             n_jobs=-1,
             param_grid={'model__strategy': ['median'],
                         'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                       13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
                                       23, 24, 25, 26, 27, 28, 29, 30, ...]},
             scoring='neg_mean_absolute_error')

Parameters: {'model__strategy': 'median', 'select__k': 1}

Performance: NMAE (val): -6953359.144736841 | RMSE train: 8058570.051086258 | MAE train: 6899205.369863014 | RMSE train in validation: 8120616.171716434 | MAE train in validation: 6944040.205479452 | RMSE test in validation: 7809144.902737563 | MAE test in validation: 6720947.2602739725
Execution time: 1.6907029151916504s

As expected, the selection of attributes does not improve the results of the dummy regressor model and it is useless in terms of performance. This is due to the fact that the dummy regressor is a model that does not take into account the attributes, so it does not matter if we select them or not, so it selects just one (k=1).

5.5. Results¶

First, we adjust the times by adding to the time of the selection of the attributes the time of training the real model. This is because the selection of the attributes is done after the training of the model, so it does not contain the time of the training of the model.

This is just an approximation to how much the selection of the attributes will last, as we are using two different models and pipelines, we are not able to calculate it directly.

In [138]:
# We store the partial times for future use
partial_times = times.copy()

# Real time adjustment (we add the time of the attribute selection to the time of the model real training)
print(times)
for key in times.keys():
    # If even, we dont add the time of the attribute selection
    # If odd, we add the time of the attribute selection
    if list(times.keys()).index(key) % 2 != 0:
        times[key] += times[key.replace("_k", "")]
print(times)
{'KNN_pred': 4.260819673538208, 'KNN_pred_k': 4.260819673538208, 'KNN_select': 6.565516233444214, 'KNN_select_k': 4.645997762680054, 'RegTrees_pred': 0.5705435276031494, 'RegTrees_pred_k': 3.333441972732544, 'RegTrees_select': 16.35970973968506, 'RegTrees_select_k': 27.90829086303711, 'LinearReg_pred': 0.2956578731536865, 'LinearReg_pred_k': 2.2786672115325928, 'LinearReg_select': 4.527256488800049, 'LinearReg_select_k': 2.1258764266967773, 'DummyReg': 0.23421549797058105, 'DummyReg_k': 1.6907029151916504}
{'KNN_pred': 4.260819673538208, 'KNN_pred_k': 8.521639347076416, 'KNN_select': 6.565516233444214, 'KNN_select_k': 11.211513996124268, 'RegTrees_pred': 0.5705435276031494, 'RegTrees_pred_k': 3.9039855003356934, 'RegTrees_select': 16.35970973968506, 'RegTrees_select_k': 44.26800060272217, 'LinearReg_pred': 0.2956578731536865, 'LinearReg_pred_k': 2.5743250846862793, 'LinearReg_select': 4.527256488800049, 'LinearReg_select_k': 6.653132915496826, 'DummyReg': 0.23421549797058105, 'DummyReg_k': 1.9249184131622314}
In [139]:
np.random.seed(10)

# ! Obtain best, worst, fastest and slowest model
max_score = max(results.values(), key=lambda x: abs(x[0]))  # We use the scoring (NMAE) as explained above to select the best model
min_score = min(results.values(), key=lambda x: abs(x[0]))
# Obtain the key name of the best and worst model
max_time = max(times.values(), key=lambda x: x)
min_time = min(times.values(), key=lambda x: x)

best_model = list(results.keys())[list(results.values()).index(min_score)]
worst_model = list(results.keys())[list(results.values()).index(max_score)]
fastest_model = list(times.keys())[list(times.values()).index(min_time)]
slowest_model = list(times.keys())[list(times.values()).index(max_time)]

print(f"Best model: {best_model} with score (-NMAE) {abs(min_score[0])} and time {list(times.values())[list(results.values()).index(min_score)]}s")
print(f"Worst model: {worst_model} with score (-NMAE) {abs(max_score[0])} and time {list(times.values())[list(results.values()).index(max_score)]}s")
print(f"Fastest model: {fastest_model} with score (-NMAE) {abs(results[fastest_model][0])} and time {min_time}s")
print(f"Slowest model: {slowest_model} with score(-NMAE) {abs(results[slowest_model][0])} and time {max_time}s")


# ! Average (test MAE) score of the models
avg_score = 0
avg_time = 0

for key, value in results.items():
    avg_score += results[key][0]
    avg_time += times[key]

print(f"\nAverage models score: {abs(avg_score/len(results))}")
print(f"Average models time: {avg_time/len(times)}\n")


# ! Differences
print("The score difference between the best and worst model is: ", abs(max_score[0] - min_score)[0])  # Scoring evaluation -NMAE
print("The score difference between the best and fastest model is: ", abs(min_score[0] - abs(results[fastest_model][0])))  # Scoring evaluation -NMAE
print("The time difference between the best and fastest model model is: ", abs(list(times.values())[list(results.values()).index(min_score)] - min_time))
print("The time difference between the fastest and slowest model is: ", abs(max_time - min_time))
Best model: LinearReg_select_k with score (-NMAE) 2389586.491181177 and time 6.653132915496826s
Worst model: DummyReg with score (-NMAE) 6953359.144736841 and time 0.23421549797058105s
Fastest model: DummyReg with score (-NMAE) 6953359.144736841 and time 0.23421549797058105s
Slowest model: RegTrees_select_k with score(-NMAE) 2727416.151315789 and time 44.26800060272217s

Average models score: 3373778.2092190557
Average models time: 7.990802492414202

The score difference between the best and worst model is:  4563772.653555664
The score difference between the best and fastest model is:  9342945.635918017
The time difference between the best and fastest model model is:  6.418917417526245
The time difference between the fastest and slowest model is:  44.03378510475159
In [140]:
# Print the results up to now
plt.rcParams['figure.figsize'] = [10, 3.5]

# ! Plot the scores (NMAE in evaluation)
print("MODEL SCORES (NMAE in evaluation)")
iter = 0
for key, value in results.items():
    plt.bar(key, abs(value[0]))
    print(f"{iter}. {key}: {abs(value[0])}")
    iter += 1
plt.title("Score")
plt.xlabel("Model")
plt.ylabel("NMAE scoring in validation")
plt.tight_layout()

plt.xticks(rotation=45, ha='right', size=7)

# Exporting image as png to ../data/img folder
plt.savefig("../data/img/basic_methods_score.png")
plt.show()

# ! Plot the time (just the even ones == the ones that are not selectors of attributes)
print("MODEL TIMES (s)")
iter = 0
for key, value in times.items():
    if iter % 2 == 0:
        plt.bar(key, value)
        print(f"{iter}. {key}: {value}")
    iter += 1
plt.title("Time")
plt.xlabel("Model")
plt.ylabel("Time (s)")
plt.tight_layout()

plt.xticks(rotation=45, ha='right', size=7)

# Exporting image as png to ../data/img folder
plt.savefig("../data/img/basic_methods_time.png")
plt.show()

# ! Plot the time (just the odd ones == the selectors of attributes)
print("MODEL ATTRIBUTE SELECTION TIMES (s)")
iter = 0
for key, value in times.items():
    if iter % 2 != 0:
        plt.bar(key, value)
        print(f"{iter}. {key}: {value}")
    iter += 1
plt.title("Time to select attributes")
plt.xlabel("Model")
plt.ylabel("Time (s)")
plt.tight_layout()

plt.xticks(rotation=45, ha='right', size=7)

# Exporting image as png to ../data/img folder - easier to visualize the annotations, better resolution
plt.savefig("../data/img/basic_methods_time_atb.png")
plt.show()
MODEL SCORES (NMAE in evaluation)
0. KNN_pred: 3239984.25
1. KNN_pred_k: 2690780.4078947366
2. KNN_select: 2880131.5631625694
3. KNN_select_k: 2603870.865432223
4. RegTrees_pred: 3467149.4407894737
5. RegTrees_pred_k: 3328832.171052632
6. RegTrees_select: 2743220.575657895
7. RegTrees_select_k: 2727416.151315789
8. LinearReg_pred: 2437056.0592061607
9. LinearReg_pred_k: 2421796.652193799
10. LinearReg_select: 2396352.0117066414
11. LinearReg_select_k: 2389586.491181177
12. DummyReg: 6953359.144736841
13. DummyReg_k: 6953359.144736841
MODEL TIMES (s)
0. KNN_pred: 4.260819673538208
2. KNN_select: 6.565516233444214
4. RegTrees_pred: 0.5705435276031494
6. RegTrees_select: 16.35970973968506
8. LinearReg_pred: 0.2956578731536865
10. LinearReg_select: 4.527256488800049
12. DummyReg: 0.23421549797058105
MODEL ATTRIBUTE SELECTION TIMES (s)
1. KNN_pred_k: 8.521639347076416
3. KNN_select_k: 11.211513996124268
5. RegTrees_pred_k: 3.9039855003356934
7. RegTrees_select_k: 44.26800060272217
9. LinearReg_pred_k: 2.5743250846862793
11. LinearReg_select_k: 6.653132915496826
13. DummyReg_k: 1.9249184131622314

5.6. Conclusions:¶

After computing all of the models, we can draw some conclusions:

  • Best model: the best model in terms of scoring (-NMAE) is the LinearReg_select_k (linear regression with selected parameters and selected attributes) model, with a score of 2389586.491181177
  • Fastest model: the fastest model is the dummy regressor, as it does not need to train the model, it just returns the median of the target variable.
  • Fastest Basic model: the fastest basic model is the Linear Regression with predefined parameters, as it does not need to select attributes. Surprisingly, it is pretty accurate as well (linear regression models work pretty well out of the box without needing parameter selection), having a score 2437056.0592061607 near to the one obtained by the Linear Regression with selected parameters and attribute selection.

¿Are the obtained results better than the ones from the naive approach?

Yes, they are, all of the models had a way better score than the one obtained by the naive approach, which had an -NMAE of 6953359.144736841 (with and without attribute selection). This can be clearly seen in the first graph showing the different models and their scores.
There, it can be observed that all of the models we tested outperformed the dummy model by a significant margin. The dummy model produced an NMAE error of -6953359.144736841, while our worst-performing model (RegTrees_pred) produced an error of -3467149.4407894737.

¿Are models with selection of parameters better than the ones that use the predefined ones?

We found that in general, selecting hyperparameters led to better results across all of the models. However, this improvement came at the cost of increased time and computing resources. Therefore, when deciding which model to use, it is important to consider the balance between improved performance and increased training time.

¿Are models with selection of attributes better than the ones that use all of them?

By observing the performace (both time and score), and the graphs, we can observe that the models with attribute selection generally perform better than those that use all attributes as they help reduce noise and irrelevant features, resulting in a more focused set of attributes for modeling, improving accuracy and generalization. However, as it can be observed in the third graph, attribute selection increases computation time as it adds a stage of preprocessing to the model. Despite the longer training times, the benefits of improved performance may outweigh the costs. As stated with selection of parameters question, it is needed to consider specific project requirements and resources when deciding on the approach to use.

Model selection

Regarding the individual models, we observed that the LinearReg_select_k model performed the best in terms of NMAE, while the RegTrees_pred model performed the worst, as it overfits perfectly the training data.

Based on these findings, we would recommend using the LinearReg_select_k model (-NMAE: 2389586.491181177; time: 4.264322519302368) if the client prioritizes accuracy over computing time. However, if computing time is a priority, we would recommend using the LinearReg_pred model (-NMAE: 2437056.0592061607; time: 0.23032045364379883), which only sacrifices about 47470 points of accuracy points (1.95%) while reducing computing time by more than 94.2%.

On the other hand, if we wanted a balance in between those two models, there are LinearReg_pred_k (-NMAE: 2421796.652193799; time: 2.2443368434906006) and LinearReg_select (-NMAE: 2396352.0117066414; time: 2.5291051864624023), which are a good balance between accuracy and computing time.

Ultimately, the decision of which model to use depends on the client's budget and objectives. For the final prediction, we recommend using the LinearReg_select_k (or LinearReg_select) model as it still provides a good balance between accuracy and computing time. Moreover, taking into account the dataset and the problem nature, it is plausible that the model training will be done at much yearly, so the time saved by using the LinearReg_pred instead of the LinearReg_pred_k model is not significant compared to the score gained.


6. Reducing Dimensionality¶

It is possible to reduce the problem's dimensionality, as evidenced by the findings in the EDA section, where numerous attributes were identified to be highly correlated. By removing some of these attributes, we can effectively reduce the dimensionality of the problem. Therefore, it is recommended to utilize Principal Component Analysis (PCA) as a technique to reduce the dimensionality of the problem.

As highlighted in the EDA section, there are several attributes that exhibit strong interrelationships to the point of being redundant (with a correlation higher than 98%).

There are two different approaches to reduce the dimensionality of the problem:

  • The first one was removing by hand the attributes seen as redundant in the EDA section. This approach was not used as it would be a tedious and error-prone process, and it would not be scalable to other datasets.

  • The second approach, which is the one used in the industry, is to use a feature selection algorithm preprocessing in the pipeline of the models to automatically identify and remove redundant attributes.

This second approach of using pipelines with attribute selection was the one employed in our project. It was implemented using the "SelectKBest(f_regression)" feature selector, which considers only the linear relationship between the attributes and the output variable. As consecuence, using this feature selector leaves room to more possible optimisations and selections of correlated relationships of non-linear nature or interrelationships between the attributes, as seen in the EDA section. Therefore, there is still a space to improve the results by using a more advanced feature selection algorithm, such as the "Recursive Feature Elimination" algorithm (RFE).

In [141]:
# Create a dictionary with all the dataset variables
def get_variable_freq():
    columns = disp_df.columns.tolist()
    variables = {col: 0 for col in columns}

    # Getting the selected attributes for each model
    for model in models.keys():
        # We only want to check the models that select attributes (take into account that dummy regressor selection is included(dswrf_s3_1))
        if list(models.keys()).index(model) % 2 != 0:
            # We get the selected attributes
            selected_atb = models[model].best_estimator_.named_steps["select"].get_support()
            # We get the names of the selected attributes
            selected_atb_names = X_train.columns[selected_atb]
            
            print(f"{model} selected {len(selected_atb_names)}")
            
            # Make a frequency table of the selected attributes
            selected_atb_names = pd.DataFrame(selected_atb_names)
            selected_atb_names.columns = ["Attribute"]
            selected_atb_names = selected_atb_names.groupby("Attribute").size().reset_index(name="Frequency")
            selected_atb_names = selected_atb_names.sort_values(by="Frequency", ascending=False)
            selected_atb_names = selected_atb_names.reset_index(drop=True)
            
            # Append the results the dictionary
            for atb in selected_atb_names["Attribute"]:
                variables[atb] += 1

    print(f"Attributes frequency: {variables}")

    # plot all the attributes and their frequency
    plt.rcParams['figure.figsize'] = [10, 3.5]

    for key, value in variables.items():
        plt.bar(key, value)

    plt.title("Frequency of the selected attributes")
    plt.xlabel("Attribute")
    plt.ylabel("Frequency")
    plt.tight_layout()

    plt.xticks(rotation=45, ha='right', size=6.5)
    plt.show()

get_variable_freq()
KNN_pred_k selected 6
KNN_select_k selected 6
RegTrees_pred_k selected 9
RegTrees_select_k selected 4
LinearReg_pred_k selected 72
LinearReg_select_k selected 72
DummyReg_k selected 1
Attributes frequency: {'apcp_sf1_1': 2, 'apcp_sf2_1': 2, 'apcp_sf3_1': 2, 'apcp_sf4_1': 2, 'apcp_sf5_1': 2, 'dlwrf_s1_1': 2, 'dlwrf_s2_1': 2, 'dlwrf_s3_1': 2, 'dlwrf_s4_1': 2, 'dlwrf_s5_1': 2, 'dswrf_s1_1': 2, 'dswrf_s2_1': 5, 'dswrf_s3_1': 7, 'dswrf_s4_1': 6, 'dswrf_s5_1': 6, 'pres_ms1_1': 0, 'pres_ms2_1': 0, 'pres_ms3_1': 0, 'pres_ms4_1': 2, 'pres_ms5_1': 2, 'pwat_ea1_1': 2, 'pwat_ea2_1': 2, 'pwat_ea3_1': 2, 'pwat_ea4_1': 2, 'pwat_ea5_1': 2, 'spfh_2m1_1': 2, 'spfh_2m2_1': 2, 'spfh_2m3_1': 2, 'spfh_2m4_1': 2, 'spfh_2m5_1': 2, 'tcdc_ea1_1': 2, 'tcdc_ea2_1': 2, 'tcdc_ea3_1': 2, 'tcdc_ea4_1': 2, 'tcdc_ea5_1': 2, 'tcolc_e1_1': 2, 'tcolc_e2_1': 2, 'tcolc_e3_1': 2, 'tcolc_e4_1': 2, 'tcolc_e5_1': 2, 'tmax_2m1_1': 2, 'tmax_2m2_1': 2, 'tmax_2m3_1': 2, 'tmax_2m4_1': 2, 'tmax_2m5_1': 2, 'tmin_2m1_1': 2, 'tmin_2m2_1': 2, 'tmin_2m3_1': 2, 'tmin_2m4_1': 2, 'tmin_2m5_1': 2, 'tmp_2m_1_1': 2, 'tmp_2m_2_1': 2, 'tmp_2m_3_1': 2, 'tmp_2m_4_1': 2, 'tmp_2m_5_1': 2, 'tmp_sfc1_1': 2, 'tmp_sfc2_1': 2, 'tmp_sfc3_1': 2, 'tmp_sfc4_1': 2, 'tmp_sfc5_1': 2, 'ulwrf_s1_1': 2, 'ulwrf_s2_1': 2, 'ulwrf_s3_1': 2, 'ulwrf_s4_1': 3, 'ulwrf_s5_1': 3, 'ulwrf_t1_1': 2, 'ulwrf_t2_1': 2, 'ulwrf_t3_1': 2, 'ulwrf_t4_1': 2, 'ulwrf_t5_1': 2, 'uswrf_s1_1': 2, 'uswrf_s2_1': 6, 'uswrf_s3_1': 5, 'uswrf_s4_1': 2, 'uswrf_s5_1': 3, 'salida': 0}

Upon analyzing the graph, we gain valuable insights into the attributes that are frequently selected by the feature selector like dswrf_s2_1, dswrf_s3_1, dswrf_s4_1, dswrf_s5_1, uswrf_s2_1, and uswrf_s3_1. It is evident that the attributes chosen are highly correlated with the target variable, aligning with our expectations. This reaffirms the efficacy of the feature selector in identifying relevant attributes correlated to the target variable.
Take into account that some of the mentioned attributes are also highly correlated with each other (as seen during EDA), but this is not a problem that our feature selector is able to identify.

Conversely, we can also infer that attributes that are scarcely selected by the feature selector, such as pres_ms1_1, pres_ms2_1, and pres_ms3_1, are not significant in the context of the problem. This is indicative that these attributes lack correlation with the target variable, and their inclusion in the model may introduce noise or irrelevant information. Hence, the feature selector's ability to filter out such attributes further strengthens its effectiveness in feature selection and highlights the importance of using it for improved model performance.

In [142]:
plt.rcParams['figure.figsize'] = [10, 3.5]

# Select the even times (the ones that are not selectors of attributes)
times_no_atb = {k: v for k, v in times.items() if list(times.keys()).index(k) % 2 == 0}
# Select the odd times (the ones that are selectors of attributes)
times_atb = {k: v for k, v in times.items() if list(times.keys()).index(k) % 2 != 0}

# Sum both dictionaries to get the total time of each model
for key in times_atb.keys():
    times_atb[key] += times_no_atb[key.replace("_k", "")]

times_no_atb_arr = list(times_no_atb.values())
times_atb_arr = list(times_atb.values())

model_indices = np.arange(len(list(times_no_atb.keys())))

width = 0.35
fig, ax = plt.subplots()
rects1 = ax.bar(model_indices - width/2, times_no_atb_arr, width, label='No attribute selection')
rects2 = ax.bar(model_indices + width/2, times_atb_arr, width, label='Attribute selection')

ax.set_xlabel('Model')
ax.set_ylabel('Times')
ax.set_title('')
ax.set_xticks(model_indices)
ax.set_xticklabels(list(times_no_atb.keys()))
ax.legend()

plt.xticks(size=5.9)
plt.show()


iter = 0
for key, value in results.items():
    plt.bar(key, abs(value[0]))
    iter += 1
plt.title("Score")
plt.xlabel("Model")
plt.ylabel("NMAE scoring in validation")
plt.tight_layout()

plt.xticks(rotation=30, ha='right', size=7)
plt.show()

As mentioned in the "Conclusions" section (Section 5.6), the performance of the models with attribute selection is generally better compared to using all attributes. This is because attribute selection helps reduce noise and irrelevant features, resulting in a more focused set of attributes for modeling, which can lead to improved accuracy and generalization, as evident from the performance metrics and graphs. However, it should be noted that attribute selection does add an additional preprocessing stage to the model, which can increase computation time. Despite the longer training times, the potential benefits of improved performance may outweigh the costs.

The performance of the models with and without attribute selection is clearly depicted in the two above graphs:

  • The first graph illustrates that the models with attribute selection require more computing time compared to those using all attributes. This is expected due to the additional preprocessing stage.

  • The second graph demonstrates how the models with attribute selection outperform those using all attributes, as they effectively reduce noise and irrelevant features, resulting in improved performance.


7. Advanced methods¶

In order to be consistent, although we have already seen that usign the selection of attributes makes the model better, we will continue to use the two-step pipeline method we have been using in the basic models. This way, we can also verify that the results are better than the ones obtained with the basic methods (for both with and without attribute selection).

7.1. Support Vector Machines (SVMs)¶

Support Vector Machine (SVM) is a supervised learning algorithm used for classification and regression analysis. SVM works by finding the hyperplane that best separates the data into different classes. The hyperplane is chosen such that it maximizes the margin between the closest data points from each class, known as support vectors. SVM can also use kernel functions to transform the input data into a higher dimensional space, allowing the separation of non-linearly separable data.

Note: in this dataset, the target variable conatains very large values, and the default value of C is 1.0 by default, which is too small for this dataset. This will make the SVM act as if it was a dummy regressor (as seen before), where it simply predicts the mean of the target variable for all data points, leading to poor model performance. This can be readily observed in section 8 of the notebook, where we compare the values and results of all the models, including the computation time and score. Notably, we find that the results of the Support Vector Machine (SVM) model with the default value of C are identical to those of the dummy regressor.

To overcome this issue, it is important to select an appropriate value for the C parameter that matches the characteristics of the dataset. By increasing the value of C to a more suitable value, the SVM becomes more flexible and capable of fitting the data better. This allows the SVM to capture the underlying patterns and relationships in the dataset more accurately, resulting in improved prediction performance.

In [143]:
from sklearn.svm import SVR

7.1.1. SVMs - Predefined parameters¶

7.1.1.1. SVMs - Predefined parameters - No attribute selection¶

In [144]:
np.random.seed(10)
n_splits = 5

pipeline = Pipeline(
    [
        ('scaler', RobustScaler()),
        ("model", SVR())
    ]
)

param_grid = {
    "model__kernel": ["rbf"],
    "model__C": [1.0],
    "model__gamma": ["scale"],
    "model__epsilon": [0.1],
}

model = GridSearchCV(
    pipeline,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(n_splits),
)


start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()

total_time = end_time - start_time

# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train)  # We al ready did the 5th fold split at the begginning

# We obtain the different scores of the model
score = train_validation_test(
    model,
    model.best_estimator_,
    model.best_score_,
    X_train,
    y_train,
)

models["SVM_pred"] = model
results["SVM_pred"] = score
times["SVM_pred"] = total_time

print_results("SVM PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline
NMAE in validation: -6953343.12
RMSE train: 8058525.28 | MAE train: 6899170.65
RMSE validation train: 8120576.93 | MAE validation train: 6944009.69
RMSE validation test: 7809107.04 | MAE validation test: 6720917.54
---------------------------------------------------
SVM PREDEFINED PARAMETERS best model is:

GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
             estimator=Pipeline(steps=[('scaler', RobustScaler()),
                                       ('model', SVR())]),
             param_grid={'model__C': [1.0], 'model__epsilon': [0.1],
                         'model__gamma': ['scale'], 'model__kernel': ['rbf']},
             scoring='neg_mean_absolute_error')

Parameters: {'model__C': 1.0, 'model__epsilon': 0.1, 'model__gamma': 'scale', 'model__kernel': 'rbf'}

Performance: NMAE (val): -6953343.117286754 | RMSE train: 8058525.276321875 | MAE train: 6899170.647284467 | RMSE train in validation: 8120576.9337218935 | MAE train in validation: 6944009.686073507 | RMSE test in validation: 7809107.037449846 | MAE test in validation: 6720917.536373118
Execution time: 2.1233932971954346s

As it was stated before, it can be clearly seen that with the default 1.0 value of C, the SVM acts as a dummy regressor.

7.1.1.2. SVMs - Predefined parameters - Attribute selection¶

In [145]:
np.random.seed(10)
n_splits = 5

pipeline = Pipeline(
    [
        ("scaler", RobustScaler()),
        ("select", SelectKBest(f_regression)),
        ("model", SVR()),
    ]
)

param_grid = {
    "model__kernel": ["rbf"],
    "model__C": [1.0],
    "model__gamma": ["scale"],
    "model__epsilon": [0.1],
    "select__k": list(range(1, X_train.shape[1] + 1)),
}

model = GridSearchCV(
    pipeline,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(n_splits),
    n_jobs=-1,
)


start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()

total_time = end_time - start_time

# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train)  # We al ready did the 5th fold split at the begginning

# We obtain the different scores of the model
score = train_validation_test(
    model,
    model.best_estimator_,
    model.best_score_,
    X_train,
    y_train,
)

models["SVM_pred_k"] = model
results["SVM_pred_k"] = score
times["SVM_pred_k"] = total_time

print_results("SVM PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline
NMAE in validation: -6952999.55
RMSE train: 8057851.47 | MAE train: 6898491.59
RMSE validation train: 8120039.58 | MAE validation train: 6943474.80
RMSE validation test: 7808606.98 | MAE validation test: 6720375.02
---------------------------------------------------
SVM PREDEFINED PARAMETERS best model is:

GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
             estimator=Pipeline(steps=[('scaler', RobustScaler()),
                                       ('select',
                                        SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
                                       ('model', SVR())]),
             n_jobs=-1,
             param_grid={'model__C': [1.0], 'model__epsilon': [0.1],
                         'model__gamma': ['scale'], 'model__kernel': ['rbf'],
                         'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                       13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
                                       23, 24, 25, 26, 27, 28, 29, 30, ...]},
             scoring='neg_mean_absolute_error')

Parameters: {'model__C': 1.0, 'model__epsilon': 0.1, 'model__gamma': 'scale', 'model__kernel': 'rbf', 'select__k': 1}

Performance: NMAE (val): -6952999.553407727 | RMSE train: 8057851.465218631 | MAE train: 6898491.591050504 | RMSE train in validation: 8120039.578355538 | MAE train in validation: 6943474.801851249 | RMSE test in validation: 7808606.977541712 | MAE test in validation: 6720375.017111049
Execution time: 29.39649200439453s

7.1.2. SVMs - Selected parameters¶

Building upon the previous definition, we can reduce the most important parameters to be adjusted to the following for SVM:

  • Kernel: This parameter defines the type of kernel used to transform the input data into a higher-dimensional space in order to perform classification. The most commonly used kernels are linear, polynomial, radial basis function (RBF), and sigmoid.
  • C: This parameter determines the trade-off between maximizing the margin and minimizing the classification error. A smaller value of C creates a larger margin but may misclassify some data points, while a larger value of C may lead to overfitting.
  • Gamma: This parameter defines the influence of each training example on the decision boundary. A smaller value of gamma makes the decision boundary smoother, while a larger value of gamma makes it more complex and can lead to overfitting.
In [146]:
rmse = []
mae = []
rmse2 = []
mae2 = []

a_c = [1.0, 100, 10000, 100000, 1000000, 10000000, 100000000, 1000000000, 10000000000, 100000000000, 1000000000000]
a_kernel = ["linear", "rbf", "sigmoid", "poly"]

for i in a_c:
    model = SVR(C=i)
    model.fit(X_train_5th_fold_train , y_train_5th_fold_train )
    y_pred = model.predict(X_test_5th_fold_train)
    rmse.append(np.sqrt(mean_squared_error(y_test_5th_fold_train , y_pred)))
    mae.append(mean_absolute_error(y_test_5th_fold_train, y_pred))

for i in a_kernel:
    model = SVR(kernel=i)
    model.fit(X_train_5th_fold_train , y_train_5th_fold_train )
    y_pred = model.predict(X_test_5th_fold_train)
    rmse2.append(np.sqrt(mean_squared_error(y_test_5th_fold_train , y_pred)))
    mae2.append(mean_absolute_error(y_test_5th_fold_train, y_pred))

# Crear dos subplots, uno para cada gráfico
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(8, 12))

# Graficar RMSE vs. C en el primer subplot
ax1.plot(a_c, rmse, label="RMSE")
ax1.set_xlabel("C")
ax1.set_ylabel("RMSE")
ax1.set_title("Gráfica de RMSE")

# Graficar MAE vs. C en el segundo subplot
ax2.plot(a_c, mae, label="MAE")
ax2.set_xlabel("C")
ax2.set_ylabel("MAE")
ax2.set_title("Gráfica de MAE")

# Graficar RMSE vs. kernel en el tercer subplot
ax3.plot(a_kernel, rmse2, label="RMSE")
ax3.set_xlabel("kernel")
ax3.set_ylabel("RMSE")
ax3.set_title("Gráfica de RMSE")

# Graficar MAE vs. metric en el cuarto subplot
ax4.plot(a_kernel, mae2, label="MAE")
ax4.set_xlabel("kernel")
ax4.set_ylabel("MAE")
ax4.set_title("Gráfica de MAE")

plt.tight_layout()
plt.rcParams['figure.figsize'] = [10, 3]
plt.show()

7.1.2.1. SVMs - Selected parameters - No attribute selection¶

Note: we needed to add more parameters to C in order to have a budget of 75 so its computing time is comparable to the other models.

In [147]:
np.random.seed(10)
budget = 75 
n_splits = 5

pipeline = Pipeline(
    [
        ("scaler", StandardScaler()),
        # We scale the data to avoid overfitting - Recommended for SVMs
        ("model", SVR())
        # Support Vector Regression (SVR for regression, SVC for classification)
    ]
)

# We need to reduce the C parameter number to reduce the computational time -> tends to infinity
param_grid = {
    "model__kernel": ["linear", "rbf", "sigmoid", "poly"], # poly is too slow and not near good as linear
    "model__C": [500000, 5000000, 7000000, 750000, 7750000, 800000, 8500000, 1000000, 5000000, 10000000],
    "model__gamma": ["scale", "auto"],
}

# We use TimeSeriesSplit to split the data in folds without losing the temporal order
model = RandomizedSearchCV(
    pipeline,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(n_splits),
    n_iter=budget,
    n_jobs=-1,
)

start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()

total_time = end_time - start_time

# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train)  # We al ready did the 5th fold split at the begginning

# We obtain the different scores of the model
score = train_validation_test(
    model,
    model.best_estimator_,
    model.best_score_,
    X_train,
    y_train,
)

models["SVM_select"] = model
results["SVM_select"] = score
times["SVM_select"] = total_time

print_results("SVM SELECTED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline
NMAE in validation: -2331297.20
RMSE train: 3390722.81 | MAE train: 2254336.08
RMSE validation train: 3402804.08 | MAE validation train: 2244918.83
RMSE validation test: 3486393.92 | MAE validation test: 2328968.77
---------------------------------------------------
SVM SELECTED PARAMETERS best model is:

RandomizedSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
                   estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                             ('model', SVR())]),
                   n_iter=75, n_jobs=-1,
                   param_distributions={'model__C': [500000, 5000000, 7000000,
                                                     750000, 7750000, 800000,
                                                     8500000, 1000000, 5000000,
                                                     10000000],
                                        'model__gamma': ['scale', 'auto'],
                                        'model__kernel': ['linear', 'rbf',
                                                          'sigmoid', 'poly']},
                   scoring='neg_mean_absolute_error')

Parameters: {'model__kernel': 'linear', 'model__gamma': 'auto', 'model__C': 1000000}

Performance: NMAE (val): -2331297.199428374 | RMSE train: 3390722.8061495544 | MAE train: 2254336.0822791597 | RMSE train in validation: 3402804.0781806447 | MAE train in validation: 2244918.82901025 | RMSE test in validation: 3486393.9201029483 | MAE test in validation: 2328968.7736492744
Execution time: 165.52876663208008s

7.1.2.2. SVMs - Selected parameters - Attribute selection¶

In [149]:
np.random.seed(10)
n_splits = 5

# We use Ridge as model as it is the best performing one
pipeline = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("select", SelectKBest(f_regression)),
        ("model", SVR())
    ]
)

# Previous model Parameters: {'model__kernel': 'linear', 'model__gamma': 'auto', 'model__C': 1000000}

param_grid = {
    "model__kernel": ["linear"],
    "model__C": [1000000],
    "model__gamma": ["auto"],
    "select__k": list(range(1, X_train.shape[1] + 1)),
}

# We use TimeSeriesSplit to split the data in folds without losing the temporal order
model = GridSearchCV(
    pipeline,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(n_splits),
    n_jobs=-1,
)

start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()

total_time = end_time - start_time

# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train)  # We al ready did the 5th fold split at the begginning

# We obtain the different scores of the model
score = train_validation_test(
    model,
    model.best_estimator_,
    model.best_score_,
    X_train,
    y_train,
)

models["SVM_select_k"] = model
results["SVM_select_k"] = score
times["SVM_select_k"] = total_time

print_results("SVM SELECTED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline
NMAE in validation: -2331773.95
RMSE train: 3384381.68 | MAE train: 2251590.34
RMSE validation train: 3452046.46 | MAE validation train: 2272265.76
RMSE validation test: 3570029.36 | MAE validation test: 2374500.73
---------------------------------------------------
SVM SELECTED PARAMETERS best model is:

GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('select',
                                        SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
                                       ('model', SVR())]),
             n_jobs=-1,
             param_grid={'model__C': [1000000], 'model__gamma': ['auto'],
                         'model__kernel': ['linear'],
                         'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                       13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
                                       23, 24, 25, 26, 27, 28, 29, 30, ...]},
             scoring='neg_mean_absolute_error')

Parameters: {'model__C': 1000000, 'model__gamma': 'auto', 'model__kernel': 'linear', 'select__k': 61}

Performance: NMAE (val): -2331773.9543751357 | RMSE train: 3384381.6752997166 | MAE train: 2251590.3404441564 | RMSE train in validation: 3452046.4594078064 | MAE train in validation: 2272265.7630818374 | RMSE test in validation: 3570029.3628540644 | MAE test in validation: 2374500.7294935877
Execution time: 38.443318367004395s

7.2. Random Forests¶

Random forest is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class or mean prediction of the individual trees. Random forests improve on the decision tree model by reducing overfitting and increasing accuracy. This is achieved by generating multiple decision trees and then aggregating their predictions through a voting system.

In [151]:
from sklearn.ensemble import RandomForestRegressor

7.2.1. Random Forests - Predefined parameters¶

7.2.1.1. Random Forests - Predefined parameters - No attribute selection¶

In [150]:
np.random.seed(10)
n_splits = 5

pipeline = Pipeline([("model", RandomForestRegressor(random_state=10))])

param_grid = {
    "model__n_estimators": [100],
    "model__criterion": ["squared_error"],
    "model__max_depth": [None],
    "model__min_samples_split": [2],
    "model__max_features": [None],
}

model = GridSearchCV(
    pipeline,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(n_splits),
    n_jobs=-1,
)

start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()

total_time = end_time - start_time

# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train)  # We al ready did the 5th fold split at the begginning

# We obtain the different scores of the model
score = train_validation_test(
    model,
    model.best_estimator_,
    model.best_score_,
    X_train,
    y_train,
)

models["RandForest_pred"] = model
results["RandForest_pred"] = score
times["RandForest_pred"] = total_time

print_results("RANDOM FOREST PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline
NMAE in validation: -2453026.62
RMSE train: 1230647.08 | MAE train: 859275.53
RMSE validation train: 1247101.73 | MAE validation train: 871871.43
RMSE validation test: 3316103.20 | MAE validation test: 2268131.29
---------------------------------------------------
RANDOM FOREST PREDEFINED PARAMETERS best model is:

GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
             estimator=Pipeline(steps=[('model',
                                        RandomForestRegressor(random_state=10))]),
             n_jobs=-1,
             param_grid={'model__criterion': ['squared_error'],
                         'model__max_depth': [None],
                         'model__max_features': [None],
                         'model__min_samples_split': [2],
                         'model__n_estimators': [100]},
             scoring='neg_mean_absolute_error')

Parameters: {'model__criterion': 'squared_error', 'model__max_depth': None, 'model__max_features': None, 'model__min_samples_split': 2, 'model__n_estimators': 100}

Performance: NMAE (val): -2453026.6184210526 | RMSE train: 1230647.077689153 | MAE train: 859275.5293150685 | RMSE train in validation: 1247101.733618154 | MAE train in validation: 871871.4328767123 | RMSE test in validation: 3316103.1974173784 | MAE test in validation: 2268131.293150685
Execution time: 22.279118299484253s

7.2.1.2. Random Forests - Predefined parameters - Attribute selection¶

In [152]:
np.random.seed(10)
n_splits = 5

pipeline = Pipeline(
    [("select", SelectKBest(f_regression)), ("model", RandomForestRegressor(random_state=10))]
)

param_grid = {
    "model__n_estimators": [100],
    "model__criterion": ["squared_error"],
    "model__max_depth": [None],
    "model__min_samples_split": [2],
    "model__max_features": [None],
    "select__k": list(range(1, X_train.shape[1] + 1)),
}

model = GridSearchCV(
    pipeline,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(n_splits),
    n_jobs=-1,
)

start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()

total_time = end_time - start_time

# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train)  # We al ready did the 5th fold split at the begginning

# We obtain the different scores of the model
score = train_validation_test(
    model,
    model.best_estimator_,
    model.best_score_,
    X_train,
    y_train,
)

models["RandForest_pred_k"] = model
results["RandForest_pred_k"] = score
times["RandForest_pred_k"] = total_time

print_results("RANDOM FOREST PREDEFINED PARAMETERS", model, score, total_time)
Results of the best estimator of Pipeline
NMAE in validation: -2453026.62
RMSE train: 1230647.08 | MAE train: 859275.53
RMSE validation train: 1246492.73 | MAE validation train: 872079.40
RMSE validation test: 3310640.67 | MAE validation test: 2264352.50
---------------------------------------------------
RANDOM FOREST PREDEFINED PARAMETERS best model is:

GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
             estimator=Pipeline(steps=[('select',
                                        SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
                                       ('model',
                                        RandomForestRegressor(random_state=10))]),
             n_jobs=-1,
             param_grid={'model__criterion': ['squared_error'],
                         'model__max_depth': [None],
                         'model__max_features': [None],
                         'model__min_samples_split': [2],
                         'model__n_estimators': [100],
                         'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                       13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
                                       23, 24, 25, 26, 27, 28, 29, 30, ...]},
             scoring='neg_mean_absolute_error')

Parameters: {'model__criterion': 'squared_error', 'model__max_depth': None, 'model__max_features': None, 'model__min_samples_split': 2, 'model__n_estimators': 100, 'select__k': 72}

Performance: NMAE (val): -2453026.6184210526 | RMSE train: 1230647.077689153 | MAE train: 859275.5293150685 | RMSE train in validation: 1246492.7307536777 | MAE train in validation: 872079.3976027397 | RMSE test in validation: 3310640.668170457 | MAE test in validation: 2264352.497260274
Execution time: 144.7716188430786s

7.2.2. Random Forests - Selected parameters¶

Building upon the previous definition, we can reduce the most important parameters to be adjusted to the following:

  • n_estimators: controls the number of trees in the forest.
  • max_depth: controls the maximum depth of each tree in the forest.
  • min_samples_split: controls the minimum number of instances a leaf must have in order to be able to subdivide. This parameter can prevent the tree from overfitting.
  • min_samples_leaf: controls the minimum number of instances required to be at a leaf node. Like min_samples_split, this parameter can also help prevent overfitting.
In [ ]:
rmse = []
mae = []
rmse2 = []
mae2 = []
rmse3 = []
mae3 =[]

a_n_stimators = [10, 30, 50, 70, 100, 130, 170, 200]
a_max_depth = range(5, 36, 5)
a_min_samples_split = range(5, 200, 15)

for i in a_n_stimators:
    model = RandomForestRegressor(random_state=10, n_estimators=i)
    model.fit(X_train_5th_fold_train , y_train_5th_fold_train )
    y_pred = model.predict(X_test_5th_fold_train)
    rmse3.append(np.sqrt(mean_squared_error(y_test_5th_fold_train , y_pred)))
    mae3.append(mean_absolute_error(y_test_5th_fold_train, y_pred))
    
for i in a_max_depth:
    model = RandomForestRegressor(random_state=10, max_depth=i)
    model.fit(X_train_5th_fold_train , y_train_5th_fold_train )
    y_pred = model.predict(X_test_5th_fold_train)
    rmse.append(np.sqrt(mean_squared_error(y_test_5th_fold_train , y_pred)))
    mae.append(mean_absolute_error(y_test_5th_fold_train, y_pred))
    
for i in a_min_samples_split:
    model = RandomForestRegressor(random_state=10, min_samples_split=i)
    model.fit(X_train_5th_fold_train , y_train_5th_fold_train )
    y_pred = model.predict(X_test_5th_fold_train)
    rmse2.append(np.sqrt(mean_squared_error(y_test_5th_fold_train , y_pred)))
    mae2.append(mean_absolute_error(y_test_5th_fold_train, y_pred))

# Crear dos subplots, uno para cada gráfico
fig, (ax1, ax2, ax3, ax4, ax5, ax6) = plt.subplots(6, 1, figsize=(8, 12))

# Graficar RMSE vs. n_neighbors en el primer subplot
ax1.plot(list(a_max_depth), rmse, label="RMSE")
ax1.set_xlabel("max_depth")
ax1.set_ylabel("RMSE")
ax1.set_title("Gráfica de RMSE")

# Graficar MAE vs. n_neighbors en el segundo subplot
ax2.plot(list(a_max_depth), mae, label="MAE")
ax2.set_xlabel("max_depth")
ax2.set_ylabel("MAE")
ax2.set_title("Gráfica de MAE")

# Graficar RMSE vs. metric en el tercer subplot
ax3.plot(list(a_min_samples_split), rmse2, label="RMSE")
ax3.set_xlabel("min_samples_split")
ax3.set_ylabel("RMSE")
ax3.set_title("Gráfica de RMSE")

# Graficar MAE vs. metric en el cuarto subplot
ax4.plot(list(a_min_samples_split), mae2, label="MAE")
ax4.set_xlabel("min_samples_split")
ax4.set_ylabel("MAE")
ax4.set_title("Gráfica de MAE")

# Graficar RMSE vs. metric en el tercer subplot
ax5.plot(a_n_stimators, rmse3, label="RMSE")
ax5.set_xlabel("n_estimators")
ax5.set_ylabel("RMSE")
# ax5.set_title("Gráfica de RMSE")

# Graficar MAE vs. metric en el cuarto subplot
ax6.plot(a_n_stimators, mae3, label="MAE")
ax6.set_xlabel("n_estimators")
ax6.set_ylabel("MAE")
ax6.set_title("Gráfica de MAE")

plt.tight_layout()
plt.rcParams['figure.figsize'] = [10, 3]
plt.show()

7.2.2.1. Random Forests - Selected parameters - No attribute selection¶

In [169]:
np.random.seed(10)
budget = 75
n_splits = 5

pipeline = Pipeline(
    [
        ("model", RandomForestRegressor(random_state=10))
    ]
)

param_grid = {
    "model__n_estimators": [100, 300, 350, 400, 450], #  500, 600, 700, 900, 10000 -> too slow for the minimal improvements they offer in the scoring (not even perceptible) - 450 still makes a decent improvement
    "model__max_depth": list(range(5, 36, 5)),
    "model__min_samples_split": [2, 3, 4, 5],
    "model__max_features": ["sqrt"], # log2 does not offer as good results
}

model = RandomizedSearchCV(
    pipeline,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(n_splits),
    n_iter=budget,
    n_jobs=-1,
)

start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()

total_time = end_time - start_time

# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train)  # We al ready did the 5th fold split at the begginning

# We obtain the different scores of the model
score = train_validation_test(
    model,
    model.best_estimator_,
    model.best_score_,
    X_train,
    y_train,
)

models["RandForest_select"] = model
results["RandForest_select"] = score
times["RandForest_select"] = total_time

print_results("Random Forest", model, score, total_time)
Results of the best estimator of Pipeline
NMAE in validation: -2323073.36
RMSE train: 1248296.12 | MAE train: 871530.08
RMSE validation train: 1215594.37 | MAE validation train: 850946.23
RMSE validation test: 3230903.24 | MAE validation test: 2197047.20
---------------------------------------------------
Random Forest best model is:

RandomizedSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
                   estimator=Pipeline(steps=[('model',
                                              RandomForestRegressor(random_state=10))]),
                   n_iter=75, n_jobs=-1,
                   param_distributions={'model__max_depth': [5, 10, 15, 20, 25,
                                                             30, 35],
                                        'model__max_features': ['sqrt'],
                                        'model__min_samples_split': [2, 3, 4,
                                                                     5],
                                        'model__n_estimators': [100, 300, 350,
                                                                400, 450]},
                   scoring='neg_mean_absolute_error')

Parameters: {'model__n_estimators': 450, 'model__min_samples_split': 2, 'model__max_features': 'sqrt', 'model__max_depth': 25}

Performance: NMAE (val): -2323073.358721178 | RMSE train: 1248296.1226726803 | MAE train: 871530.0753454532 | RMSE train in validation: 1215594.3653978498 | MAE train in validation: 850946.2297440937 | RMSE test in validation: 3230903.2390529006 | MAE test in validation: 2197047.195952723
Execution time: 124.77998352050781s

7.2.2.2. Random Forests - Selected parameters - Attribute selection¶

In [170]:
np.random.seed(10)
n_splits = 5

pipeline = Pipeline(
    [("select", SelectKBest(f_regression)), ("model", RandomForestRegressor(random_state=10))]
)

param_grid = {
    "model__n_estimators": [450],
    "model__max_depth": [25],
    "model__min_samples_split": [2],
    "model__max_features": ["sqrt"],
    "select__k": list(range(1, X_train.shape[1] + 1)),
}

model = GridSearchCV(
    pipeline,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(n_splits),
    n_jobs=-1,
)

start_time = time.time()
model.fit(X=X_train, y=y_train)
end_time = time.time()

total_time = end_time - start_time

# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train)  # We al ready did the 5th fold split at the begginning

# We obtain the different scores of the model
score = train_validation_test(
    model,
    model.best_estimator_,
    model.best_score_,
    X_train,
    y_train,
)

models["RandForest_select_k"] = model
results["RandForest_select_k"] = score
times["RandForest_select_k"] = total_time

print_results("Random Forest", model, score, total_time)
Results of the best estimator of Pipeline
NMAE in validation: -2322506.37
RMSE train: 1187813.56 | MAE train: 831397.07
RMSE validation train: 1216037.12 | MAE validation train: 853005.44
RMSE validation test: 3225078.82 | MAE validation test: 2191218.75
---------------------------------------------------
Random Forest best model is:

GridSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
             estimator=Pipeline(steps=[('select',
                                        SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
                                       ('model',
                                        RandomForestRegressor(random_state=10))]),
             n_jobs=-1,
             param_grid={'model__max_depth': [25],
                         'model__max_features': ['sqrt'],
                         'model__min_samples_split': [2],
                         'model__n_estimators': [450],
                         'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                       13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
                                       23, 24, 25, 26, 27, 28, 29, 30, ...]},
             scoring='neg_mean_absolute_error')

Parameters: {'model__max_depth': 25, 'model__max_features': 'sqrt', 'model__min_samples_split': 2, 'model__n_estimators': 450, 'select__k': 69}

Performance: NMAE (val): -2322506.367627253 | RMSE train: 1187813.5582928217 | MAE train: 831397.0666560882 | RMSE train in validation: 1216037.1201242164 | MAE train in validation: 853005.4435113268 | RMSE test in validation: 3225078.8210638203 | MAE test in validation: 2191218.7493181694
Execution time: 147.9776096343994s

7.3. Attribute importance¶

In [178]:
# Plotting the most used attributes
get_variable_freq()
KNN_pred_k selected 6
KNN_select_k selected 6
RegTrees_pred_k selected 9
RegTrees_select_k selected 4
LinearReg_pred_k selected 72
LinearReg_select_k selected 72
DummyReg_k selected 1
SVM_pred_k selected 1
SVM_select_k selected 61
RandForest_pred_k selected 72
RandForest_select_k selected 69
Attributes frequency: {'apcp_sf1_1': 3, 'apcp_sf2_1': 4, 'apcp_sf3_1': 5, 'apcp_sf4_1': 4, 'apcp_sf5_1': 5, 'dlwrf_s1_1': 4, 'dlwrf_s2_1': 4, 'dlwrf_s3_1': 5, 'dlwrf_s4_1': 5, 'dlwrf_s5_1': 5, 'dswrf_s1_1': 4, 'dswrf_s2_1': 8, 'dswrf_s3_1': 11, 'dswrf_s4_1': 9, 'dswrf_s5_1': 9, 'pres_ms1_1': 0, 'pres_ms2_1': 0, 'pres_ms3_1': 0, 'pres_ms4_1': 3, 'pres_ms5_1': 4, 'pwat_ea1_1': 3, 'pwat_ea2_1': 4, 'pwat_ea3_1': 4, 'pwat_ea4_1': 5, 'pwat_ea5_1': 5, 'spfh_2m1_1': 5, 'spfh_2m2_1': 5, 'spfh_2m3_1': 5, 'spfh_2m4_1': 5, 'spfh_2m5_1': 5, 'tcdc_ea1_1': 5, 'tcdc_ea2_1': 5, 'tcdc_ea3_1': 5, 'tcdc_ea4_1': 5, 'tcdc_ea5_1': 5, 'tcolc_e1_1': 5, 'tcolc_e2_1': 5, 'tcolc_e3_1': 5, 'tcolc_e4_1': 5, 'tcolc_e5_1': 5, 'tmax_2m1_1': 5, 'tmax_2m2_1': 5, 'tmax_2m3_1': 5, 'tmax_2m4_1': 5, 'tmax_2m5_1': 5, 'tmin_2m1_1': 5, 'tmin_2m2_1': 5, 'tmin_2m3_1': 5, 'tmin_2m4_1': 5, 'tmin_2m5_1': 5, 'tmp_2m_1_1': 5, 'tmp_2m_2_1': 5, 'tmp_2m_3_1': 5, 'tmp_2m_4_1': 5, 'tmp_2m_5_1': 5, 'tmp_sfc1_1': 5, 'tmp_sfc2_1': 5, 'tmp_sfc3_1': 5, 'tmp_sfc4_1': 5, 'tmp_sfc5_1': 5, 'ulwrf_s1_1': 5, 'ulwrf_s2_1': 5, 'ulwrf_s3_1': 5, 'ulwrf_s4_1': 6, 'ulwrf_s5_1': 6, 'ulwrf_t1_1': 5, 'ulwrf_t2_1': 5, 'ulwrf_t3_1': 5, 'ulwrf_t4_1': 5, 'ulwrf_t5_1': 5, 'uswrf_s1_1': 5, 'uswrf_s2_1': 9, 'uswrf_s3_1': 8, 'uswrf_s4_1': 5, 'uswrf_s5_1': 6, 'salida': 0}
In [179]:
# Getting the importance of each attribute in the models
print("Random forest feature importance")

feature_importance_arr = []

for model in models:
    # Only for Random Forest
    if model.startswith("RandForest"):
        
        # Get the feature importances and attribute names
        feature_importances = models[model].best_estimator_.named_steps["model"].feature_importances_
        #attribute_names = models[model].best_estimator_.named_steps["preprocessor"].transformers_[0][2]
        # Print the feature importance + the name of the attribute
        for feature_importance in feature_importances:
            feature_importance_arr.append(feature_importance)

print(f"{feature_importance_arr}")
Random forest feature importance
[0.006823467236950922, 0.005289307300387328, 0.006267694296187506, 0.004022317151944208, 0.005404432357240497, 0.0022519491070036647, 0.0018361790164572482, 0.002247769952480248, 0.0014650044914462713, 0.0013006480224745894, 0.0004960851418488201, 0.0055382107495284315, 0.315742659652574, 0.3517793255489918, 0.04184928580091645, 0.002458978401449125, 0.002053964997419152, 0.0015936968889132862, 0.0020610986322658993, 0.0025472233010468913, 0.002879771282194894, 0.0024998467946694917, 0.0021275033383356613, 0.0022459040220456344, 0.003496953824292926, 0.0026076753823704698, 0.0017930838183311767, 0.0015719768382412133, 0.0027201445198449213, 0.003708652654260024, 0.003897506027980119, 0.003167200122163599, 0.0028718893936009737, 0.0018100864005531929, 0.0025377048235108606, 0.007044176167206148, 0.0062632988315099196, 0.004404341213717501, 0.003908374369991919, 0.00395205992060554, 0.0019721308557179005, 0.0011637037207955414, 0.000800379853040628, 0.000838520500870624, 0.0013666777131945717, 0.0009608105869797138, 0.0009190921579659422, 0.0008173443796027224, 0.001021960754386707, 0.00151280327981057, 0.0007639976942684739, 0.0009346498206002092, 0.0009534210634611878, 0.0012829642911412935, 0.002465317576758651, 0.001165180991507426, 0.0013971482408855862, 0.00199209205831166, 0.0024706120760282723, 0.00360350807458225, 0.0013436229867836167, 0.0021948655823592015, 0.0011227743340925592, 0.0018105761751751005, 0.0015890353220823447, 0.0033750976600800415, 0.00364699503427307, 0.003577394363633643, 0.003268232876439697, 0.0037954585711020995, 0.0003503972143710564, 0.008752891824028953, 0.04809102633776446, 0.05056290288042291, 0.009578965354528917, 0.0068403869301967395, 0.005373279303395462, 0.006372854124595899, 0.004111600062359494, 0.005382524543626419, 0.002376707941881877, 0.0018655300328057575, 0.0023187503945616595, 0.0014623332529318355, 0.0014143686169222132, 0.0004998132798549249, 0.00558797526039632, 0.3159371643993869, 0.351988610155103, 0.04212065117411924, 0.0034656893816911337, 0.0032270325690724857, 0.002935420343051062, 0.0023937024548202125, 0.0023119591202634343, 0.0022941716323672937, 0.003526358900124105, 0.0027365092286796883, 0.001765582954544091, 0.0017221784359597186, 0.002820601209027081, 0.003650330653143147, 0.003992074051631715, 0.0031344067030880975, 0.003074613747433749, 0.0018998704090882916, 0.0025954938809593956, 0.006990140912730289, 0.00644897734514765, 0.004473559756813118, 0.0037406250236200053, 0.00402454314681774, 0.002006163875451341, 0.0012158224737252418, 0.0007043584662778803, 0.0009642697822988204, 0.0012018319153018597, 0.0011034909135244964, 0.0009054555635728764, 0.0008879402231901818, 0.001085559920023129, 0.001955913969888609, 0.0007624518602052361, 0.0008781706836049682, 0.001029512529331001, 0.001349096954243529, 0.002501339848028483, 0.0011391828221795318, 0.001399662440200732, 0.0019101702579749115, 0.0024695480574327133, 0.003393099668721604, 0.0013818307482691425, 0.0021432896705809437, 0.0012325333968976205, 0.0016715611251833181, 0.0018575620441252682, 0.0037406654663174246, 0.0035201938928982276, 0.003608244753719502, 0.0034049511389126594, 0.0037936346289837663, 0.00032082449417820773, 0.008803707887331204, 0.0482002525909784, 0.05068142859196787, 0.009899886012268103, 0.003724035703835566, 0.0062359361179643216, 0.009240913210498341, 0.0058127312331301495, 0.005372061245997567, 0.0029465278723924157, 0.0033012693536005255, 0.003251694102810587, 0.0023560055460886955, 0.0024355632035187942, 0.0004236419769430372, 0.061599560073731824, 0.09121017194569941, 0.09190147878427775, 0.0899781182151589, 0.0023106723244598697, 0.002380129580585116, 0.002309647463192926, 0.0022045185874568505, 0.0023375751909906714, 0.0037045066333714696, 0.004394961453288986, 0.0036257550464911734, 0.0032218892535575885, 0.0031306358653928083, 0.002381892753595732, 0.002360840897174175, 0.0023876405322844378, 0.0029395366380763174, 0.003070244226964902, 0.007223606447359112, 0.011819211834030799, 0.008633007896986646, 0.0060502797575006685, 0.00399533509753541, 0.006818744483782783, 0.013415431657438318, 0.009817634506956403, 0.006095255590229829, 0.00460012059346887, 0.0021554413534535907, 0.0027284219254231665, 0.005250853161822686, 0.007624268986852327, 0.005098788125669833, 0.0018391598461120668, 0.001986793219779255, 0.0021815200639112757, 0.004969594575969665, 0.012872757247774012, 0.001951277869362719, 0.003982290976427233, 0.00603765245448148, 0.009867923533034227, 0.01298464716438702, 0.0019704910959919956, 0.00414342000609132, 0.012112937225630735, 0.02291973606720845, 0.010725025342562434, 0.0020494403497396535, 0.0032765366573998707, 0.003928727252962354, 0.013274934271127805, 0.02041052055492802, 0.002736095709877521, 0.006332732491709332, 0.008790365434515857, 0.011744338408147098, 0.017296899204761195, 0.0006843381698444326, 0.07671220695457114, 0.06854948839276186, 0.06464427316059666, 0.06514731984729216, 0.007733787879633859, 0.009203910580481157, 0.005883398098927908, 0.005586300117648668, 0.0033886070005346137, 0.003250393730715769, 0.003252941408309692, 0.002544006630893072, 0.0026915716395631848, 0.0005035528915993646, 0.05849207968349354, 0.088025909872651, 0.10228702484621725, 0.09221107319459551, 0.0027695957918516074, 0.004172588624545819, 0.004010671551562766, 0.0035812002224717434, 0.003512597770595825, 0.0026632339724947077, 0.0026167481464412046, 0.002547337600033131, 0.002973510335774411, 0.0035224074896548067, 0.00663833109510814, 0.011783420132970795, 0.008558065898171529, 0.005307442994300936, 0.003791095630677494, 0.008802753095846605, 0.014173866773218455, 0.009471583900944443, 0.006477952777471706, 0.0047612177007301335, 0.002430657290167516, 0.00329565654911136, 0.003905075888666834, 0.006302551285052551, 0.0062497040501772495, 0.0019487786598519168, 0.0020139206995887305, 0.00190188114483014, 0.008527372394746047, 0.012121269286707871, 0.002066044377839217, 0.0040608826634516235, 0.0040999243541863725, 0.004696990742636009, 0.009840383869773702, 0.0020716293517295164, 0.0037579750458387614, 0.011898670876887427, 0.021356473011829087, 0.011658375964503014, 0.002269671455828889, 0.0023767923179314177, 0.004373268570312354, 0.014836026245567995, 0.0182371460817263, 0.003090051935164686, 0.004645871298289147, 0.012897685394097504, 0.017111736203689196, 0.01860208498866022, 0.0004009093072549272, 0.07327670150901809, 0.08316695799290197, 0.059525515045730476, 0.0637951850661211]
In [180]:
# Get the 5 most important attributes
# Sort feature importances in descending order
if model.startswith("RandForest"):
        print(model)
        # Get the feature importances and attribute names
        feature_importances = models[model].best_estimator_.named_steps["model"].feature_importances_
        
importances_descending = sorted(zip(feature_importances, disp_df.columns), reverse=True)
# Print top n attributes and their importances
n_top_attributes = 5
for importance, attribute_name in importances_descending[:n_top_attributes]:
    print(f"{attribute_name}: {importance}")
RandForest_select_k
dswrf_s3_1: 0.10228702484621725
dswrf_s4_1: 0.09221107319459551
dswrf_s2_1: 0.088025909872651
ulwrf_t2_1: 0.08316695799290197
ulwrf_t1_1: 0.07327670150901809

First of all, it must be understanded that the easiest way of knowing the most relevant attributes is by using the trees, since this model uses the attributes in ranked relevance to split data in each level. As we can see in the list provided, the most relevant attributes are the following:

  • dswrf_s4_1: 0.46959314797977664
  • dswrf_s3_1: 0.19949434015634968
  • uswrf_s4_1: 0.04859888546408619
  • dswrf_s5_1: 0.04723173888571538
  • uswrf_s3_1: 0.040436760096997794

The importance and meaning of this attributes can be found in the EDA section.

On the other hand, when checking the frequency of our selected attributes from our scoring function, we need to take into account that some of them are highly correlated with each other, but as it was stated before, our scoring function cant take into account this type of correlations, so it is important to keep in mind that the most relevant attributes are the ones that are highly correlated with the target variable.

That is why the most relevant attributes may be the ones that the Random Forest model has selected, since it is a tree-based model, and it is able to take into account the correlations between the attributes. But at the end all depends on quality of the data, the size of the dataset, the specific problem being solved, and the quality of the model.

7.4. Conclusions¶

In [185]:
# Select the even times (the ones that are not selectors of attributes)
times_no_atb = {k: v for k, v in times.items() if list(times.keys()).index(k) % 2 == 0}
# Select the odd times (the ones that are selectors of attributes)
times_atb = {k: v for k, v in times.items() if list(times.keys()).index(k) % 2 != 0}

# Sum both dictionaries to get the total time of each model
for key in times_atb.keys():
    times_atb[key] += times_no_atb[key.replace("_k", "")]

times_no_atb_arr = list(times_no_atb.values())
times_atb_arr = list(times_atb.values())

# Solo los que empiezan por SVM o RandForest
model_indices = np.arange(len(list(times_no_atb.keys())))

width = 0.35
fig, ax = plt.subplots()
rects1 = ax.bar(model_indices - width/2, times_no_atb_arr, width, label='No attribute selection')
rects2 = ax.bar(model_indices + width/2, times_atb_arr, width, label='Attribute selection')

ax.set_xlabel('Model')
ax.set_ylabel('Times')
ax.set_title('')
ax.set_xticks(model_indices)
ax.set_xticklabels(list(times_no_atb.keys()))
ax.legend()

plt.xticks(size=5.9)
plt.show()

iter = 0
for key, value in results.items():
    plt.bar(key, abs(value[0]))
    iter += 1
plt.title("Score")
plt.xlabel("Model")
plt.ylabel("NMAE scoring in validation")
plt.tight_layout()

plt.xticks(rotation=30, ha='right', size=7)
plt.show()

iter = 0
for key, value in results.items():
    plt.bar(key, abs(value[6]))
    iter += 1
plt.title("Score")
plt.xlabel("Model")
plt.ylabel("MAE in test validation")
plt.tight_layout()

plt.xticks(rotation=30, ha='right', size=7)
plt.show()

The process of selecting the best parameters and attributes for advanced models is a crucial step in machine learning. In this study, the selection of attributes was performed using the SelectKBest method, which helps to eliminate attributes that have a low correlation with the output, resulting in better performance for the models. The selection of parameters was also a critical factor in improving the model's accuracy. In particular, the models with parameter selection achieved better scores than the models without parameter selection.

However, it's important to note that this improvement in performance comes with a trade-off: the increase in score implies a notable increase in time and computational cost. Therefore, when selecting the best models, it's essential to consider not only their performance but also their computational complexity.

After testing several models, the three best-performing models were identified as Random Forests, SVMs, and Linear Regression, with or without attribute and parameter selection. These models showed the highest scores in the experiments, and one of them will be selected as the final model in section 8.1.1, where we will compare the results of both SVM and Random Forests closely in order to make a wise decision.

Overall, the results of this study demonstrate the importance of selecting the right parameters and attributes when building advanced machine learning models. By doing so, we can achieve better accuracy and performance, leading to more effective and efficient machine learning applications.


8. Best model¶

We will re-visit all the models and select the best one, which we have stated to be the one with the lowest MAE and the lowest RMSE.

In [186]:
# ! Print the models best parameters
i = 0
for key, value in models.items():
    print(f"\n\n{i}. Sected model: {key}\n")
    print(f"Parameters: {value.best_params_}")
    print(
        f"\nPerformance:\n",
        f"NMAE (val): {results[key][0]}\n",
        f"RMSE train: {results[key][1]} | ",
        f"MAE train: {results[key][2]}\n",
        f"RMSE train in validation: {results[key][3]} | ",
        f"MAE train in validation: {results[key][4]}\n",
        f"RMSE test in validation: {results[key][5]} | ",
        f"MAE test in validation: {results[key][6]}",
        sep="",
    )
    print(f"Time: {times[key]} s")
    i+=1

plt.rcParams['figure.figsize'] = [10, 3.5]

# ! Plot (NMAE)
print("MODEL SCORES (NMAE in evaluation)")
iter = 0
for key, value in results.items():
    plt.bar(key, abs(value[0]))
    print(f"{iter}. {key}: {abs(value[0])}")
    iter += 1
plt.title("Score")
plt.xlabel("Model")
plt.ylabel("NMAE scoring in validation")
plt.tight_layout()

plt.xticks(rotation=45, ha='right', size=7)

# Exporting image as png to ../data/img folder
plt.savefig("../data/img/advanced_methods_score.png")
plt.show()

# ! Plot (MAE train in validation)
iter = 0
for key, value in results.items():
    plt.bar(key, abs(value[4]))
    print(f"{iter}. {key}: {abs(value[4])}")
    iter += 1
plt.title("Score")
plt.xlabel("Model")
plt.ylabel("MAE train in validation")
plt.tight_layout()

plt.xticks(rotation=45, ha='right', size=7)

# Exporting image as png to ../data/img folder
plt.savefig("../data/img/basic_methods_score.png")
plt.show()

# ! Plot (MAE test in validation)
iter = 0
for key, value in results.items():
    plt.bar(key, abs(value[6]))
    print(f"{iter}. {key}: {abs(value[6])}")
    iter += 1
plt.title("Score")
plt.xlabel("Model")
plt.ylabel("MAE test in validation")
plt.tight_layout()

plt.xticks(rotation=45, ha='right', size=7)

# Exporting image as png to ../data/img folder
plt.savefig("../data/img/basic_methods_score.png")
plt.show()

# ! Plot the time
iter = 0
for key, value in times.items():
    plt.bar(key, value)
    print(f"{iter}. {key}: {value}")
    iter += 1
plt.title("Time")
plt.xlabel("Model")
plt.ylabel("Time (s)")
plt.tight_layout()

plt.xticks(rotation=45, ha='right', size=7)

# ! Plot the accumulated approximated real times
print("Accumulated approximated real times")
# Select the even times (the ones that are not selectors of attributes)
times_no_atb = {k: v for k, v in times.items() if list(times.keys()).index(k) % 2 == 0}
# Select the odd times (the ones that are selectors of attributes)
times_atb = {k: v for k, v in times.items() if list(times.keys()).index(k) % 2 != 0}

# Sum both dictionaries to get the total time of each model
for key in times_atb.keys():
    times_atb[key] += times_no_atb[key.replace("_k", "")]

times_no_atb_arr = list(times_no_atb.values())
times_atb_arr = list(times_atb.values())

model_indices = np.arange(len(list(times_no_atb.keys())))

width = 0.35
fig, ax = plt.subplots()
rects1 = ax.bar(model_indices - width/2, times_no_atb_arr, width, label='No attribute selection')
rects2 = ax.bar(model_indices + width/2, times_atb_arr, width, label='Attribute selection')

ax.set_xlabel('Model')
ax.set_ylabel('Times')
ax.set_title('')
ax.set_xticks(model_indices)
ax.set_xticklabels(list(times_no_atb.keys()))
ax.legend()

plt.xticks(size=5.9)
plt.show()


# Exporting image as png to ../data/img folder - easier to visualize the annotations, better resolution
plt.savefig("../data/img/advanced_methods_time.png")
plt.show()

0. Sected model: KNN_pred

Parameters: {'model__algorithm': 'auto', 'model__metric': 'minkowski', 'model__n_neighbors': 5, 'model__weights': 'uniform'}

Performance:
NMAE (val): -3239984.25
RMSE train: 3517654.379918169 | MAE train: 2493007.2164383563
RMSE train in validation: 3557480.484807456 | MAE train in validation: 2518007.157534247
RMSE test in validation: 4152140.058048495 | MAE test in validation: 2892257.01369863
Time: 4.260819673538208 s


1. Sected model: KNN_pred_k

Parameters: {'model__algorithm': 'auto', 'model__metric': 'minkowski', 'model__n_neighbors': 5, 'model__weights': 'uniform', 'select__k': 6}

Performance:
NMAE (val): -2690780.4078947366
RMSE train: 3108869.5243311627 | MAE train: 2162755.2657534247
RMSE train in validation: 3116226.5417679944 | MAE train in validation: 2171515.705479452
RMSE test in validation: 3775814.085873258 | MAE test in validation: 2560118.5479452056
Time: 8.521639347076416 s


2. Sected model: KNN_select

Parameters: {'model__weights': 'distance', 'model__n_neighbors': 17, 'model__metric': 'manhattan', 'model__algorithm': 'kd_tree'}

Performance:
NMAE (val): -2880131.5631625694
RMSE train: 0.0 | MAE train: 0.0
RMSE train in validation: 0.0 | MAE train in validation: 0.0
RMSE test in validation: 3732609.9812009404 | MAE test in validation: 2587777.1287017944
Time: 6.565516233444214 s


3. Sected model: KNN_select_k

Parameters: {'model__algorithm': 'kd_tree', 'model__metric': 'manhattan', 'model__n_neighbors': 9, 'model__weights': 'distance', 'select__k': 6}

Performance:
NMAE (val): -2603870.865432223
RMSE train: 1355.336484531192 | MAE train: 31.726027397260275
RMSE train in validation: 25827.044900407618 | MAE train in validation: 675.9246575342465
RMSE test in validation: 3681057.75211333 | MAE test in validation: 2483096.382277287
Time: 11.211513996124268 s


4. Sected model: RegTrees_pred

Parameters: {'model__criterion': 'squared_error', 'model__max_depth': None, 'model__max_features': None, 'model__min_samples_split': 2}

Performance:
NMAE (val): -3467149.4407894737
RMSE train: 0.0 | MAE train: 0.0
RMSE train in validation: 0.0 | MAE train in validation: 0.0
RMSE test in validation: 4961507.791413844 | MAE test in validation: 3406755.205479452
Time: 0.5705435276031494 s


5. Sected model: RegTrees_pred_k

Parameters: {'model__criterion': 'squared_error', 'model__max_depth': None, 'model__max_features': None, 'model__min_samples_split': 2, 'select__k': 9}

Performance:
NMAE (val): -3328832.171052632
RMSE train: 0.0 | MAE train: 0.0
RMSE train in validation: 0.0 | MAE train in validation: 0.0
RMSE test in validation: 5002502.819275869 | MAE test in validation: 3460965.616438356
Time: 3.9039855003356934 s


6. Sected model: RegTrees_select

Parameters: {'model__min_samples_split': 106, 'model__max_features': None, 'model__max_depth': 30, 'model__criterion': 'absolute_error'}

Performance:
NMAE (val): -2743220.575657895
RMSE train: 3259190.446254432 | MAE train: 2080612.602739726
RMSE train in validation: 3286556.310045412 | MAE train in validation: 2092567.191780822
RMSE test in validation: 3914582.5939823505 | MAE test in validation: 2655352.602739726
Time: 16.35970973968506 s


7. Sected model: RegTrees_select_k

Parameters: {'model__criterion': 'absolute_error', 'model__max_depth': 30, 'model__max_features': None, 'model__min_samples_split': 106, 'select__k': 4}

Performance:
NMAE (val): -2727416.151315789
RMSE train: 3452866.617242818 | MAE train: 2199234.328767123
RMSE train in validation: 3561457.960699349 | MAE train in validation: 2280089.2808219176
RMSE test in validation: 4044668.035092536 | MAE test in validation: 2710957.602739726
Time: 44.26800060272217 s


8. Sected model: LinearReg_pred

Parameters: {'model__fit_intercept': True}

Performance:
NMAE (val): -2437056.0592061607
RMSE train: 3254352.603690468 | MAE train: 2321647.0597032406
RMSE train in validation: 3265297.879240584 | MAE train in validation: 2322380.6106294743
RMSE test in validation: 3268115.4760430153 | MAE test in validation: 2265683.802964292
Time: 0.2956578731536865 s


9. Sected model: LinearReg_pred_k

Parameters: {'model__fit_intercept': True, 'select__k': 72}

Performance:
NMAE (val): -2421796.652193799
RMSE train: 3256573.9989301027 | MAE train: 2323171.6092511206
RMSE train in validation: 3267629.5529683903 | MAE train in validation: 2322601.753096195
RMSE test in validation: 3267567.87998712 | MAE test in validation: 2263068.4012916926
Time: 2.5743250846862793 s


10. Sected model: LinearReg_select

Parameters: {'model__alpha': 1.0642092440647246}

Performance:
NMAE (val): -2396352.0117066414
RMSE train: 3276534.918917554 | MAE train: 2333075.6766354376
RMSE train in validation: 3292824.947383585 | MAE train in validation: 2337932.2861476443
RMSE test in validation: 3280253.2990303193 | MAE test in validation: 2260087.8287112545
Time: 4.527256488800049 s


11. Sected model: LinearReg_select_k

Parameters: {'model__alpha': 0.9693631061142517, 'select__k': 72}

Performance:
NMAE (val): -2389586.491181177
RMSE train: 3278341.466529396 | MAE train: 2333541.305110323
RMSE train in validation: 3293274.5203141714 | MAE train in validation: 2336194.5845998474
RMSE test in validation: 3278610.608896576 | MAE test in validation: 2258218.4050652594
Time: 6.653132915496826 s


12. Sected model: DummyReg

Parameters: {'model__strategy': 'median'}

Performance:
NMAE (val): -6953359.144736841
RMSE train: 8058570.051086258 | MAE train: 6899205.369863014
RMSE train in validation: 8120616.171716434 | MAE train in validation: 6944040.205479452
RMSE test in validation: 7809144.902737563 | MAE test in validation: 6720947.2602739725
Time: 0.23421549797058105 s


13. Sected model: DummyReg_k

Parameters: {'model__strategy': 'median', 'select__k': 1}

Performance:
NMAE (val): -6953359.144736841
RMSE train: 8058570.051086258 | MAE train: 6899205.369863014
RMSE train in validation: 8120616.171716434 | MAE train in validation: 6944040.205479452
RMSE test in validation: 7809144.902737563 | MAE test in validation: 6720947.2602739725
Time: 1.9249184131622314 s


14. Sected model: SVM_pred

Parameters: {'model__C': 1.0, 'model__epsilon': 0.1, 'model__gamma': 'scale', 'model__kernel': 'rbf'}

Performance:
NMAE (val): -6953343.117286754
RMSE train: 8058525.276321875 | MAE train: 6899170.647284467
RMSE train in validation: 8120576.9337218935 | MAE train in validation: 6944009.686073507
RMSE test in validation: 7809107.037449846 | MAE test in validation: 6720917.536373118
Time: 2.1233932971954346 s


15. Sected model: SVM_pred_k

Parameters: {'model__C': 1.0, 'model__epsilon': 0.1, 'model__gamma': 'scale', 'model__kernel': 'rbf', 'select__k': 1}

Performance:
NMAE (val): -6952999.553407727
RMSE train: 8057851.465218631 | MAE train: 6898491.591050504
RMSE train in validation: 8120039.578355538 | MAE train in validation: 6943474.801851249
RMSE test in validation: 7808606.977541712 | MAE test in validation: 6720375.017111049
Time: 29.39649200439453 s


16. Sected model: SVM_select

Parameters: {'model__kernel': 'linear', 'model__gamma': 'auto', 'model__C': 1000000}

Performance:
NMAE (val): -2331297.199428374
RMSE train: 3390722.8061495544 | MAE train: 2254336.0822791597
RMSE train in validation: 3402804.0781806447 | MAE train in validation: 2244918.82901025
RMSE test in validation: 3486393.9201029483 | MAE test in validation: 2328968.7736492744
Time: 165.52876663208008 s


17. Sected model: SVM_select_k

Parameters: {'model__C': 1000000, 'model__gamma': 'auto', 'model__kernel': 'linear', 'select__k': 61}

Performance:
NMAE (val): -2331773.9543751357
RMSE train: 3384381.6752997166 | MAE train: 2251590.3404441564
RMSE train in validation: 3452046.4594078064 | MAE train in validation: 2272265.7630818374
RMSE test in validation: 3570029.3628540644 | MAE test in validation: 2374500.7294935877
Time: 38.443318367004395 s


18. Sected model: RandForest_pred

Parameters: {'model__criterion': 'squared_error', 'model__max_depth': None, 'model__max_features': None, 'model__min_samples_split': 2, 'model__n_estimators': 100}

Performance:
NMAE (val): -2453026.6184210526
RMSE train: 1230647.077689153 | MAE train: 859275.5293150685
RMSE train in validation: 1247101.733618154 | MAE train in validation: 871871.4328767123
RMSE test in validation: 3316103.1974173784 | MAE test in validation: 2268131.293150685
Time: 22.279118299484253 s


19. Sected model: RandForest_pred_k

Parameters: {'model__criterion': 'squared_error', 'model__max_depth': None, 'model__max_features': None, 'model__min_samples_split': 2, 'model__n_estimators': 100, 'select__k': 72}

Performance:
NMAE (val): -2453026.6184210526
RMSE train: 1230647.077689153 | MAE train: 859275.5293150685
RMSE train in validation: 1246492.7307536777 | MAE train in validation: 872079.3976027397
RMSE test in validation: 3310640.668170457 | MAE test in validation: 2264352.497260274
Time: 144.7716188430786 s


20. Sected model: RandForest_select

Parameters: {'model__n_estimators': 450, 'model__min_samples_split': 2, 'model__max_features': 'sqrt', 'model__max_depth': 25}

Performance:
NMAE (val): -2323073.358721178
RMSE train: 1248296.1226726803 | MAE train: 871530.0753454532
RMSE train in validation: 1215594.3653978498 | MAE train in validation: 850946.2297440937
RMSE test in validation: 3230903.2390529006 | MAE test in validation: 2197047.195952723
Time: 124.77998352050781 s


21. Sected model: RandForest_select_k

Parameters: {'model__max_depth': 25, 'model__max_features': 'sqrt', 'model__min_samples_split': 2, 'model__n_estimators': 450, 'select__k': 69}

Performance:
NMAE (val): -2322506.367627253
RMSE train: 1187813.5582928217 | MAE train: 831397.0666560882
RMSE train in validation: 1216037.1201242164 | MAE train in validation: 853005.4435113268
RMSE test in validation: 3225078.8210638203 | MAE test in validation: 2191218.7493181694
Time: 147.9776096343994 s
MODEL SCORES (NMAE in evaluation)
0. KNN_pred: 3239984.25
1. KNN_pred_k: 2690780.4078947366
2. KNN_select: 2880131.5631625694
3. KNN_select_k: 2603870.865432223
4. RegTrees_pred: 3467149.4407894737
5. RegTrees_pred_k: 3328832.171052632
6. RegTrees_select: 2743220.575657895
7. RegTrees_select_k: 2727416.151315789
8. LinearReg_pred: 2437056.0592061607
9. LinearReg_pred_k: 2421796.652193799
10. LinearReg_select: 2396352.0117066414
11. LinearReg_select_k: 2389586.491181177
12. DummyReg: 6953359.144736841
13. DummyReg_k: 6953359.144736841
14. SVM_pred: 6953343.117286754
15. SVM_pred_k: 6952999.553407727
16. SVM_select: 2331297.199428374
17. SVM_select_k: 2331773.9543751357
18. RandForest_pred: 2453026.6184210526
19. RandForest_pred_k: 2453026.6184210526
20. RandForest_select: 2323073.358721178
21. RandForest_select_k: 2322506.367627253
0. KNN_pred: 2518007.157534247
1. KNN_pred_k: 2171515.705479452
2. KNN_select: 0.0
3. KNN_select_k: 675.9246575342465
4. RegTrees_pred: 0.0
5. RegTrees_pred_k: 0.0
6. RegTrees_select: 2092567.191780822
7. RegTrees_select_k: 2280089.2808219176
8. LinearReg_pred: 2322380.6106294743
9. LinearReg_pred_k: 2322601.753096195
10. LinearReg_select: 2337932.2861476443
11. LinearReg_select_k: 2336194.5845998474
12. DummyReg: 6944040.205479452
13. DummyReg_k: 6944040.205479452
14. SVM_pred: 6944009.686073507
15. SVM_pred_k: 6943474.801851249
16. SVM_select: 2244918.82901025
17. SVM_select_k: 2272265.7630818374
18. RandForest_pred: 871871.4328767123
19. RandForest_pred_k: 872079.3976027397
20. RandForest_select: 850946.2297440937
21. RandForest_select_k: 853005.4435113268
0. KNN_pred: 2892257.01369863
1. KNN_pred_k: 2560118.5479452056
2. KNN_select: 2587777.1287017944
3. KNN_select_k: 2483096.382277287
4. RegTrees_pred: 3406755.205479452
5. RegTrees_pred_k: 3460965.616438356
6. RegTrees_select: 2655352.602739726
7. RegTrees_select_k: 2710957.602739726
8. LinearReg_pred: 2265683.802964292
9. LinearReg_pred_k: 2263068.4012916926
10. LinearReg_select: 2260087.8287112545
11. LinearReg_select_k: 2258218.4050652594
12. DummyReg: 6720947.2602739725
13. DummyReg_k: 6720947.2602739725
14. SVM_pred: 6720917.536373118
15. SVM_pred_k: 6720375.017111049
16. SVM_select: 2328968.7736492744
17. SVM_select_k: 2374500.7294935877
18. RandForest_pred: 2268131.293150685
19. RandForest_pred_k: 2264352.497260274
20. RandForest_select: 2197047.195952723
21. RandForest_select_k: 2191218.7493181694
0. KNN_pred: 4.260819673538208
1. KNN_pred_k: 8.521639347076416
2. KNN_select: 6.565516233444214
3. KNN_select_k: 11.211513996124268
4. RegTrees_pred: 0.5705435276031494
5. RegTrees_pred_k: 3.9039855003356934
6. RegTrees_select: 16.35970973968506
7. RegTrees_select_k: 44.26800060272217
8. LinearReg_pred: 0.2956578731536865
9. LinearReg_pred_k: 2.5743250846862793
10. LinearReg_select: 4.527256488800049
11. LinearReg_select_k: 6.653132915496826
12. DummyReg: 0.23421549797058105
13. DummyReg_k: 1.9249184131622314
14. SVM_pred: 2.1233932971954346
15. SVM_pred_k: 29.39649200439453
16. SVM_select: 165.52876663208008
17. SVM_select_k: 38.443318367004395
18. RandForest_pred: 22.279118299484253
19. RandForest_pred_k: 144.7716188430786
20. RandForest_select: 124.77998352050781
21. RandForest_select_k: 147.9776096343994
Accumulated approximated real times
<Figure size 1000x350 with 0 Axes>

As it will be discussed later in section 8.1.1, although the SVM has a sligthly better -NMAE scoring, when in validation test, the best model by far is the Random Forests with selection of attributes and selection of parameters.

Timewise, as we argued before, is not relevant for us since the training is probably a one-time process, and the prediction is the one that is going to be used in the real world for a long time period. If not, as we will see later, we would choose the Random Forests with selection of attributes which offers a better performance in terms of MAE and RMSE.

Ultimately, if the time is a critical value for the client, we will choose the Linear Regression (already discussed the different models in section 5.5), as it is blazingly fast.

8.1. Best Model Selection¶

After carefully evaluating the results and considering various factors, we have come to the conclusion that Random Forest is the optimal choice for our model. Apart from outperforming SVM in terms of MAE and RMSE measurements in the 5th fold test-validation, Random Forest also offers several advantages.

One key advantage is its ability to handle non-linear relationships in the data. Random Forest employs a decision tree ensemble approach, which allows for capturing complex interactions and patterns in the data, making it a suitable choice for our dataset that may contain non-linear relationships between variables.

Additionally, Random Forest is known for its robustness to outliers and noise in the data. It is less sensitive to noisy data points compared to SVM, which can be especially beneficial when dealing with real-world datasets that often contain noise or outliers.

Furthermore, Random Forest is a highly scalable algorithm that can efficiently handle large datasets, making it suitable for our computational capabilities. On the other hand, SVM can be computationally intensive, especially with larger datasets and higher values of C, which may not be feasible in our current computational setup.
Although SVM has the potential to improve with an increase in C, we have weighed the trade-off between computation time and scoring results, and determined that Random Forest provides a favorable balance for our specific needs. A similar trade-off happens with Random Forest and the number of trees (estimators), but not that drastically.
The problem is that both C for SVM and estimators for Random Forest make the model better infinitely, but the computational cost is not linear, so the minimal (almost negligible) gain in performance is not worth the computational cost.

After considering both the -NMAE in validation and the test results in the fifth fold of validation, we have decided to use Random Forest as our preferred model. This is because the validation test allows us to assess how promisingly the model will perform in the actual test. Although SVM has slightly better performance in terms of -NMAE scoring, the marginal gain is not significant enough to outweigh its much worse performance (although good) in the validation test compared to Random Forest.

In conclusion, considering its superior performance in our validation tests, ability to handle non-linear relationships, robustness to noise, scalability, and computational efficiency, we have decided to select Random Forest as our preferred model for this particular project.

8.1.1. Best Model Prediction - Test¶

In [174]:
# The selected model is RandForest_select_k
sel_model = models["RandForest_select_k"]

# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train)  # We al ready did the 5th fold split at the begginning

# We obtain the different scores of the model
score = train_validation_test(
    model,
    model.best_estimator_,
    model.best_score_,
    X_train,
    y_train,
    True,
    X_test,
    y_test,
)
Results of the best estimator of Pipeline
NMAE in validation: -2489180.03
RMSE train: 2071531.63 | MAE train: 1285734.74
RMSE test: 3071048.88 | MAE test: 2111167.90
RMSE validation train: 1216037.12 | MAE validation train: 853005.44
RMSE validation test: 3225078.82 | MAE validation test: 2191218.75

As we hypothesized before, the test scores are better than the validation scores (both scoring timeseries validation and test validation), which is a good sign and expected as they were an understimation. Overall, the score is very good, with a result of: RMSE test: 3071048.88 | MAE test: 2111167.90

8.2. Selected Model Training¶

Once selected the best model, we will train in with all the data we have available, and then we will use it to predict the values of the competition dataset.

First, we divide the whole dataset into the training set (inputs, X, and outputs, y). Then, we train the model with the whole dataset, and to be predicted, the model should perform better than the one we selected first, as it has more training data.

In [187]:
X_train = disp_df.drop("salida", axis=1)  # This is the input features for training
y_train = disp_df["salida"]  # This is the target variable for training

print("Data shape: ", disp_df.shape)
print("X_train shape: ", X_train.shape)
print("y_train shape: ", y_train.shape)
Data shape:  (4380, 76)
X_train shape:  (4380, 75)
y_train shape:  (4380,)
In [188]:
# We will use the whole dataset to train the model - disp_df
np.random.seed(10)
budget = 100
n_splits = 5

pipeline = Pipeline(
    [
        ("select", SelectKBest(f_regression)),
        ("model", RandomForestRegressor(random_state=10))
    ]
)

param_grid = {
    "model__n_estimators": [100, 300, 350, 400, 450], #  500, 600, 700, 900, 10000 -> too slow for the minimal improvements they offer in the scoring (not even perceptible) - 450 still makes a decent improvement
    "model__max_depth": list(range(5, 36, 5)),
    "model__min_samples_split": [2, 3, 4, 5],
    "model__max_features": ["sqrt"], # log2 does not offer as good results
    "select__k": list(range(1, X_train.shape[1])),
}

model = RandomizedSearchCV(
    pipeline,
    param_grid,
    scoring="neg_mean_absolute_error",
    cv=TimeSeriesSplit(n_splits),
    n_iter=budget,
    n_jobs=-1,
)

start_time = time.time()
model.fit(X_train, y_train)
end_time = time.time()

total_time = end_time - start_time

# We calculate the subsets used for training and testing in the different folds of the cross-validation
# validation_splits(model, X_train)  # We al ready did the 5th fold split at the begginning

# We obtain the different scores of the model
score = train_validation_test(
    model,
    model.best_estimator_,
    model.best_score_,
    X_train,
    y_train,
)

models["final_model"] = model
results["final_model"] = score
times["final_model"] = total_time

print_results("Random Forest (Final model)", model, score, total_time)
Results of the best estimator of Pipeline
NMAE in validation: -2324597.85
RMSE train: 1249422.71 | MAE train: 868956.42
RMSE validation train: 1280013.14 | MAE validation train: 889329.11
RMSE validation test: 3218699.57 | MAE validation test: 2189210.81
---------------------------------------------------
Random Forest (Final model) best model is:

RandomizedSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
                   estimator=Pipeline(steps=[('select',
                                              SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
                                             ('model',
                                              RandomForestRegressor(random_state=10))]),
                   n_iter=100, n_jobs=-1,
                   param_distributions={'model__max_depth': [5, 10, 15, 20, 25,
                                                             30, 35],
                                        'model__max_features': ['sqrt'],
                                        'model__min_samples_split': [2, 3, 4,
                                                                     5],
                                        'model__n_estimators': [100, 300, 350,
                                                                400, 450],
                                        'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9,
                                                      10, 11, 12, 13, 14, 15,
                                                      16, 17, 18, 19, 20, 21,
                                                      22, 23, 24, 25, 26, 27,
                                                      28, 29, 30, ...]},
                   scoring='neg_mean_absolute_error')

Parameters: {'select__k': 70, 'model__n_estimators': 300, 'model__min_samples_split': 3, 'model__max_features': 'sqrt', 'model__max_depth': 30}

Performance: NMAE (val): -2324597.848848228 | RMSE train: 1249422.7076020176 | MAE train: 868956.4214434926 | RMSE train in validation: 1280013.139932164 | MAE train in validation: 889329.1075172715 | RMSE test in validation: 3218699.573567223 | MAE test in validation: 2189210.8078122223
Execution time: 117.42434859275818s

To be observed, just as before, the scoring -NMAE is not as good as the test evaluation partition (which is a good indicator of the performance of the model with the competition dataset), but it is still a good indicator of the performance of the model: NMAE in validation: -2324597.85 | RMSE validation test: 3218699.57 | MAE validation test: 2189210.81

Note how the results are similar (better) to the ones predicted in the validation fold scoring, validation test, and test previously.

8.2.1. Selected Model Prediction and Comparison¶

The bad thing about using the whole dataset for training is that we don't have any data left for testing the model's performance. Without a separate set of data for testing, we cannot accurately evaluate how well the model generalizes to unseen data.

To address this issue, we have implemented a function that calculates the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) of the model on the fifth fold of the train-validation splits. This allows us to obtain an estimate of the model's performance on the most trained fold, which can serve as an indication of how well the model is likely to perform in the near future.

In [191]:
plt.rcParams['figure.figsize'] = [10, 3.5]

print("MODEL SCORES (NMAE in evaluation)")
iter = 0
for key, value in results.items():
    plt.bar(key, abs(value[0]))
    print(f"{iter}. {key}: {abs(value[0])}")
    iter += 1
plt.title("Score")
plt.xlabel("Model")
plt.ylabel("NMAE scoring in validation")
plt.tight_layout()

plt.xticks(rotation=45, ha='right', size=7)

# Exporting image as png to ../data/img folder
plt.savefig("../data/img/best_methods_score.png")
plt.show()


print("MODEL SCORES (NMAE in evaluation)")
iter = 0
for key, value in results.items():
    plt.bar(key, abs(value[6]))
    print(f"{iter}. {key}: {abs(value[6])}")
    iter += 1
plt.title("Score")
plt.xlabel("Model")
plt.ylabel("NMAE scoring in validation")
plt.tight_layout()

plt.xticks(rotation=45, ha='right', size=7)

# Exporting image as png to ../data/img folder
plt.savefig("../data/img/best_methods_score.png")
plt.show()
MODEL SCORES (NMAE in evaluation)
0. KNN_pred: 3239984.25
1. KNN_pred_k: 2690780.4078947366
2. KNN_select: 2880131.5631625694
3. KNN_select_k: 2603870.865432223
4. RegTrees_pred: 3467149.4407894737
5. RegTrees_pred_k: 3328832.171052632
6. RegTrees_select: 2743220.575657895
7. RegTrees_select_k: 2727416.151315789
8. LinearReg_pred: 2437056.0592061607
9. LinearReg_pred_k: 2421796.652193799
10. LinearReg_select: 2396352.0117066414
11. LinearReg_select_k: 2389586.491181177
12. DummyReg: 6953359.144736841
13. DummyReg_k: 6953359.144736841
14. SVM_pred: 6953343.117286754
15. SVM_pred_k: 6952999.553407727
16. SVM_select: 2331297.199428374
17. SVM_select_k: 2331773.9543751357
18. RandForest_pred: 2453026.6184210526
19. RandForest_pred_k: 2453026.6184210526
20. RandForest_select: 2323073.358721178
21. RandForest_select_k: 2322506.367627253
22. final_model: 2324597.848848228
MODEL SCORES (NMAE in evaluation)
0. KNN_pred: 2892257.01369863
1. KNN_pred_k: 2560118.5479452056
2. KNN_select: 2587777.1287017944
3. KNN_select_k: 2483096.382277287
4. RegTrees_pred: 3406755.205479452
5. RegTrees_pred_k: 3460965.616438356
6. RegTrees_select: 2655352.602739726
7. RegTrees_select_k: 2710957.602739726
8. LinearReg_pred: 2265683.802964292
9. LinearReg_pred_k: 2263068.4012916926
10. LinearReg_select: 2260087.8287112545
11. LinearReg_select_k: 2258218.4050652594
12. DummyReg: 6720947.2602739725
13. DummyReg_k: 6720947.2602739725
14. SVM_pred: 6720917.536373118
15. SVM_pred_k: 6720375.017111049
16. SVM_select: 2328968.7736492744
17. SVM_select_k: 2374500.7294935877
18. RandForest_pred: 2268131.293150685
19. RandForest_pred_k: 2264352.497260274
20. RandForest_select: 2197047.195952723
21. RandForest_select_k: 2191218.7493181694
22. final_model: 2189210.8078122223

As mentioned before, the results regarding scoring of the final model are the best overall.

8.3. Selected Model Export¶

In [194]:
import pickle

print(models["final_model"].best_params_)

selected_model = models["final_model"]

print(f"\nSelected model: {selected_model}, {type(selected_model)}, {selected_model.best_params_}")

# Export model as pickle file in ../data/model folder
with open("../data/model/modelo_final.pkl", "wb") as file:
    pickle.dump(selected_model, file)

# ! Compare the model exported with the one loaded - check if it is the same
# Load model from pickle file
with open("../data/model/modelo_final.pkl", "rb") as file:
    loaded_model = pickle.load(file)

print(f"\nSaved model: {loaded_model}, {type(loaded_model)}, {loaded_model.best_params_}")
    
if selected_model.best_params_ == loaded_model.best_params_:
    print("\n\nThe models has been saved and loaded correctly")
else:
    print("\n\nERROR: The models are different")
{'select__k': 70, 'model__n_estimators': 300, 'model__min_samples_split': 3, 'model__max_features': 'sqrt', 'model__max_depth': 30}

Selected model: RandomizedSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
                   estimator=Pipeline(steps=[('select',
                                              SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
                                             ('model',
                                              RandomForestRegressor(random_state=10))]),
                   n_iter=100, n_jobs=-1,
                   param_distributions={'model__max_depth': [5, 10, 15, 20, 25,
                                                             30, 35],
                                        'model__max_features': ['sqrt'],
                                        'model__min_samples_split': [2, 3, 4,
                                                                     5],
                                        'model__n_estimators': [100, 300, 350,
                                                                400, 450],
                                        'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9,
                                                      10, 11, 12, 13, 14, 15,
                                                      16, 17, 18, 19, 20, 21,
                                                      22, 23, 24, 25, 26, 27,
                                                      28, 29, 30, ...]},
                   scoring='neg_mean_absolute_error'), <class 'sklearn.model_selection._search.RandomizedSearchCV'>, {'select__k': 70, 'model__n_estimators': 300, 'model__min_samples_split': 3, 'model__max_features': 'sqrt', 'model__max_depth': 30}

Saved model: RandomizedSearchCV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None),
                   estimator=Pipeline(steps=[('select',
                                              SelectKBest(score_func=<function f_regression at 0x7f817b0e3f40>)),
                                             ('model',
                                              RandomForestRegressor(random_state=10))]),
                   n_iter=100, n_jobs=-1,
                   param_distributions={'model__max_depth': [5, 10, 15, 20, 25,
                                                             30, 35],
                                        'model__max_features': ['sqrt'],
                                        'model__min_samples_split': [2, 3, 4,
                                                                     5],
                                        'model__n_estimators': [100, 300, 350,
                                                                400, 450],
                                        'select__k': [1, 2, 3, 4, 5, 6, 7, 8, 9,
                                                      10, 11, 12, 13, 14, 15,
                                                      16, 17, 18, 19, 20, 21,
                                                      22, 23, 24, 25, 26, 27,
                                                      28, 29, 30, ...]},
                   scoring='neg_mean_absolute_error'), <class 'sklearn.model_selection._search.RandomizedSearchCV'>, {'select__k': 70, 'model__n_estimators': 300, 'model__min_samples_split': 3, 'model__max_features': 'sqrt', 'model__max_depth': 30}


The models has been saved and loaded correctly

9. Final Conclusions¶

During this project, we have had the opportunity to gain a deeper understanding of the model selection process. We began with exploratory data analysis (EDA), which helped us to improve our understanding and management of the data. We found this to be an extremely useful tool throughout the entire project. We believe that this part of the project should be evaluated with greater emphasis, as it is the foundation upon which all of our decisions were based.

Next, we created and trained all of our models, gaining experience in the use of pipelines and a deeper understanding of the importance of hyperparameters. Finally, we analyzed the different results provided by each model, gaining a better understanding of their respective advantages and disadvantages in terms of scoring and time.

We believe that this project is an excellent complement to the main lessons, as it provides a deeper understanding of the subject matter.


X. Output the Jupyter Notebook as an HTML file¶

In [196]:
import os

# Export the notebook to HTML
os.system("jupyter nbconvert --to html model.ipynb --output ../data/html/model.html")
print("Notebook exported to HTML")
[NbConvertApp] Converting notebook model.ipynb to html
Notebook exported to HTML
[NbConvertApp] Writing 16124216 bytes to ../data/html/model.html